Open jkbonfield opened 4 years ago
@jkbonfield Thanks for the detailed analysis. Its certainly the case that htsjdk converts the crai to bai internally, and it does seem to result in lots of unnecessary I/O.
I suspect that spending time on fixing BAI performance issues might be less useful than just switching directly to a native CRAI implementation, since htsjdk already has code to generate CRAI anyway. I'll dive deeper into this as part of my refactoring branch work, which has additional indexing tests and some other perf fixes as well.
Thanks for the response.
I thought it was interesting though to see differences in htslib's BAI usage and htsjdk's BAI usage. I'm sure BAM users would like faster random access too. I can't quite figure out what's going on there.
BAI has an R-tree style index, so there are multiple bins that may potentially hold data covering any specific region. Given the file is sorted, a very long read may overlap a query region but be a considerable distance in the file from all the short read data. Hence the R-tree. I don't really understand how the queries work though, but obviously htslib has a way of culling the bins that don't contain content for any given query, presumably via the linear index component.
CRAI in comparison is unbelievably naive, but can be implemented efficiently with care. Every line in the CRAI (it's gzipped text) is just the same meta-data that appears in the slice headers. Chromosome, start, end, slice offset & size, file offset, etc. The index is sorted by the chr/start coord, but if we have a wide variety of read lengths then possibly we'll get many short slices (by range, not bytes) following a long one and those can be put inside the long range (the "nested" bit), potentially recursively, so that in memory we just have a large array that can be binary searched, and then binary search again within it if it's a container entry. Maybe not the best explanation, but it seems to work. It does lack the linear index component of BAI though so I wonder if it'll be inefficient on some data access patterns.
This is interesting. I will try to look into the bad behavior in bam files that you've mentioned here. It's not obvious to me why it would be reading additional chunks but it's clearly doing something suboptimal.
This is possibly related to https://github.com/samtools/htsjdk/issues/1162, but I see issues with BAM too so this is not entirely a CRAM problem. I think it's a generic index related thing, which is exasperated in CRAM due to larger block sizes.
This is a long post, so I'll summarise here. NA12878 plat genomes; 10 small regions; picard vs samtools.
BAM: picard reads around 60% more data than samtools. CRAM: picard reads ~14x more data than samtools! (and is slooow).
My htsjdk is via picard 2.20.0. Java is 1.8.0_201. OS is Ubuntu 12.04 LTS (Precise - yes it's ancient).
My access pattern is 10 regions for chr1 to chr10 at pos 10Mb to 10.01Mb. Ie:
I'm running picard ViewSam to extract regions use this interval list, along with strace to dump all the file I/O, thus:
Similarly for CRAM:
We can plot these reads of Nth read vs Pth position in file using gnuplot and some hideous sed/awk hackery:
Or alternatively if you just want the summary stats, use https://github.com/jkbonfield/io_trace with e.g.
io_trace -p ../.. -- /software/bin/java -jar /nfs/sam_scratch/rmd/picard.jar ViewSam I=../../NA12878/NA12878_S1_1.3.0.cram ALIGNMENT_STATUS=All PF_STATUS=All INTERVAL_LIST=_j R=~/scratch/data/indices/hgc19.fa
So we see 75x as many read calls and 8x as many bytes read using CRAM as BAM. That's a huge performance hit. Running unix time on my viewsam (without the slow trace) gives me 2.2s for BAM and 19.4s for CRAM. So that's around 9x slower on CRAM than BAM.
On the surface, this appears to be more evidence for #1162, but it's not that simple. Let's look at samtools:
Several things spring to mind.
The number of reads returned in all cases (52996) are identical, so clearly the programs work OK. As expected samtools BAM/CRAM to SAM conversion times were considerably faster, and not showing the same ratio of BAM vs CRAM degredation (arounsd 0.2 and 0.5s for BAM and CRAM).
Looking at my _pos.samtools.bam vs _pos.picard.bam I see samtools has a whole load of 32Kb reads (buffered I/O) and htsjdk alternates 18 byte reads and 16-20Kb reads (presumably a bgzf block). Neither are hideously inefficient. However htsjdk did a lot of interim reads at positions that samtools didn't bother to check. Are these the higher up bins being probed in the BAI index to check whether there is data matching there? I don't have a good familiarity with the index format, but I think it's possible to know they're empty without an attempted seek/read to check (linear index maybe? not sure).
We can plot this visually: X axis is Nth read call and Y axis is location in the file. They're actually diagonal lines, but the file is so vast they look horizontal. The length of the line isn't the number of bytes read, but the number of read calls.
Samtools:
Picard:
Much the same regions for the bulk of the I/O, but we can clearly see picard's fetching of other chunks of file.
Now looking at the same chart for CRAM queries...
Samtools:
Picard:
Again we see the same tail of extra locations being scanned, but also the horizontal section is visibly no longer horizontal in picard meaning it's reading much much more data. Clearly visible the X axis units are vastly different too. Looking at the log files we see Samtools is mostly multiples of 32Kb reads (due to the buffered hfile layer) while Picard is huge swathes of single byte reads. Eg:
This is tragically slow on some file systems. Can it please be replaced with a buffered I/O stream instead to prevent this behaviour?
The second problem is the extra data being read. Picard CRAM read around 14x as many bytes as Samtools CRAM. I think this is the union of more reads that we don't need (see BAM analysis) coupled with the CRAM block size being much larger which exasperates the issue further. (With Samtools CRAM is more compressed, meaning fewer bytes, but the blocks are larger so the wasted bytes for a small range was more - flukily more or less cancelling out.)
I'm curious why we even see the same access pattern in CRAM as BAM though. My cram has a .crai index. Is htsjdk turning this in memory to a .bai index instead, and then using the same bam indexing code (hence demonstrating the same dummy reads)?
In a nutshell, as it stands htsjdk is not performant for CRAM random access, but I also see substantial wins for BAM too.
Edit: I should also add that the CRAM containers and slices have meta-data in them detailing the reference number and start/end, so the rest of the slice does not need to be decoded. This permits fast skipping (seeking) if we know the region being queried won't be in this slice. So even with the additional index queries, it could be made to be more performant.