paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
66 stars 9 forks source link

Shasta with gzipped input FASTQ #19

Closed Adoni5 closed 8 months ago

Adoni5 commented 8 months ago

Hi @paoloshasta - thanks for the great work improving Shasta.

I was wondering if there is currently a way to pass compressed input (.gz or .xz in this case) into Shasta? I've tried directly assembling compressed FASTQ and I cannae get it to work.

Thanks, Rory

paoloshasta commented 8 months ago

No, you have to separately decompress the file first. This has been requested before but I decided not to implement it for the following reason. The assembly runs on a large, expensive machine with a large number of CPUs and a lot of memory. It does not make economic sense to tie up that machine for a long time just to do a decompression, an essentially sequential step that can instead run on a much less expensive machine.

You could use shasta/scripts/FastqGzToFasta.py to decompress the fastq.gz file and convert to fasta in one step to disk. The smaller size of the fasta file compared to the uncompressed fastq means that the decompression process has to do less I/O and so runs faster. In addition, the smaller size of the fasta file also means that shasta will be able to read it faster. Finally, less disk space is required. The size of the uncompressed fasta is usually comparable to the size of the compressed fastq.gz.

Adoni5 commented 8 months ago

Interesting, I do see your logic there! I'm assuming that if I can input FASTA, Shasta doesn't factor the FASTQ qualities into the assembly?

In which case I will definitely do that to save space. Thanks!

paoloshasta commented 8 months ago

Even if you use a fastq file as input, Shasta does not use base qualities in the assembly. So the presence of the base qualities makes no difference in the assembly results.

Adoni5 commented 8 months ago

Brilliant, thanks very much.