jkbonfield / cram31_bench

Scripts for the CRAM 3.1 paper benchmarking
0 stars 0 forks source link

question re FASTQ --> CRAM (samtools import: truncated file. Aborting) #1

Open smrgit opened 2 years ago

smrgit commented 2 years ago

I'm trying to follow the example in samtools_encode_fastq.sh to test out converting FASTQ to CRAM, but I'm getting the error indicated above. Not sure if this is even worth doing, based on my understanding of CRAM, but wanted to inquire.

jkbonfield commented 2 years ago

It's not clear specifically why samtools import would give this error. It seems to be either because sam_read1() failed, in which case perhaps the input data isn't what it expected it to be, something odd with the index files, or a failure in bam_aux_append() for some reason. It should probably be clearer on the cause of error.

How quickly does it fail? Can you trim the input data down to reproduce the error with something small? If so is it possible to make this public? (Even if editing identifiers and sequence, as I doubt those specifically are the cause).

(Also I'm thinking this issue really belongs with samtools rather than cram31_bench, although I'm active on samtools also so it'd be me dealing with it there anyway probably.)

PS. It definitely is worth experimenting with FASTQ in CRAM as it should be better than naive gzip while still sticking to a recognised standard. There are definitely better fastq compressors out there, but depending on the circumstances it may be preferable to stick with standard tooling.

smrgit commented 2 years ago

I was trying this with two compressed fastq files (name suffixes were .fastq.gz) -- when I uncompressed them and used them that way it worked. No index files, just the two fastq files. Perhaps it was an issue with the compression -- although I was able to uncompress w/o errors, and then recompress and the recompressed files worked fine. Have you ever encountered that type of issue?

Also, the uncompressed fastq files are 27G each, the re-gzip'd versions are 5.5G each and the resulting CRAM is 5.2G -- does that sound right?

Thanks for your help!

jkbonfield commented 2 years ago

Hmm, maybe there's something wrong with the htslib bgzf code that decompresses the fastq.gz then. I'm not sure what though as I've not noticed this myself before. I take it these files aren't public? Did it work if you don't enable threading but still use the .gz files? If so then it may point to the multi-threaded decoding of gzip.

As for file size, I'd normally expect CRAM to be quite a bit better than normal gzip, but it can depend on the instrument type. What platform is the data from?

Some platforms are essentially random noise as far as naive data compression goes, making it challenging to compress (Eg 96 different quality values or fewer qualities but virtually no correlation between neighbouring values). There are also many different options in CRAM for improving compression, albeit at the expense of speed. The next fastest method would be import -O cram,small which enables bigger block sizes and bzip2 (if I recall correctly).

smrgit commented 2 years ago

I should clarify that the 5.2G CRAM takes the place of two 5.5G fastq.gz files, so that is quite good (I think).

jkbonfield commented 2 years ago

Oh! That's more like it. Phew. Thanks for clarifying. :-)