reading in files - Githubissues

isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:

https://github.com/lbcb-sci/racon

MIT License

268 stars 48 forks source link

reading in files #94

Open devonorourke opened 5 years ago

devonorourke commented 5 years ago

Hi Robert, Apologies for a potentially simple question: what is the fastest way to read the sequence data into memory with Racon? I'm using a pretty big AWS instance (r5.24xlarge - see details here) and have 96 cores at my disposal. It's my very non-technical understanding that reading in the file may not be able to leverage all these cores? Thanks very much, Devon

rvaser commented 5 years ago

Hi Devon, unfortunately I have only a single thread parser which is kinda problematic when you want to read really big files. What sizes of files are you dealing with?

Best regards, Robert

devonorourke commented 5 years ago

Thanks for the quick reply. Bummer though! Values are for uncompressed data:

.paf is about 80G,
.fq reads are about 187G
.fa draft reference is just 2G

It seems to take about an hour and a half at the moment to load the target sequences, sequences, and overlaps. I think there should be sufficient memory to run the program given that there was about 768G of RAM to start. Or, maybe I've just wasted about $40 of my money on a machine that's about to crash, ha.

Thanks for any suggestions you might have.

rvaser commented 5 years ago

Probably the slowest part is .paf parsing. You can drastically reduce its size if you replace read names in your .fq file with integers and then rerun minimap or whatever you are running. But do not stop racon now so you do not lose valuable time on the server. :D Did aligning part start?

devonorourke commented 5 years ago

I think so; here's what has been reported to my tmux window so far:

[racon::Polisher::initialize] loaded target sequences
[racon::Polisher::initialize] loaded sequences
[racon::Polisher::initialize] loaded overlaps

Would it be much work to modify these [racon::Polisher::initialize] lines to include a timestamp? Something something simple like the Linux date program, so we could see a log that produced the same information as above, but with a little bit of time info. I'm sure other things like CPUwall time, etc. might also be of value, but even knowing how long each task was run is certainly a help.

I recall you mentioning the trick about reducing file sizes by renaming headers in a previous Issues thread; totally forgot about it - thanks! I'll give that a shot for the next iteration.

rvaser commented 5 years ago

I had planned to integrate timestamps in several projects but did not have the time yet. Have to decrease the number of log messages as well! :)

devonorourke commented 5 years ago

Well two cheers for you for creating such a great tool, but three cheers when you've finished that timestamp issue!

Quick question from a recent issues post 93 - can you confirm that if I have: 100G .fastq, 45G .paf, and 5G .fasta files being loaded in, then the memory requirements for this job are 150G of RAM (give or take, like 5G)? In other words, I shouldn't need 300G to do this, right?

It seems like a lot of your issue posts are about insufficient memory. Maybe just toss your recommended memory requirements into the README.md of this repo? Maybe that and the renaming the .fastq trick (this post had a really simple one-liner to use for just such a trick)

rvaser commented 5 years ago

If you have long reads than yes, you can just sum up the file sizes and add some epsilon (add the target size twice because of the newly generated consensus, forgot about it in mentioned issue). If you have Illumina, the memory requirements are a bit higher because the sequence class overhead equals half the .fastq file if all sequences are around 100bp. So you would get something like 1.5 * .fastq + .paf + 2 * .fasta. If you have a .sam file instead, the memory requirements are higher due to cigar strings being loaded.

The awk command you referenced is fine if the .fastq file has 4 lines per sequence, but I came across folded files which are now supported in racon as well.

I have noted all your requests and will update racon with them when I get the time :)

devonorourke commented 5 years ago

Sounds great; appreciate your feedback

rvaser commented 5 years ago

Timestamps and reduced log is implemented in latest commit (v1.3.3).