Inefficient random access patterns

jkbonfield commented 4 years ago

This is possibly related to https://github.com/samtools/htsjdk/issues/1162, but I see issues with BAM too so this is not entirely a CRAM problem. I think it's a generic index related thing, which is exasperated in CRAM due to larger block sizes.

This is a long post, so I'll summarise here. NA12878 plat genomes; 10 small regions; picard vs samtools.

BAM: picard reads around 60% more data than samtools. CRAM: picard reads ~14x more data than samtools! (and is slooow).

My htsjdk is via picard 2.20.0. Java is 1.8.0_201. OS is Ubuntu 12.04 LTS (Precise - yes it's ancient).

My access pattern is 10 regions for chr1 to chr10 at pos 10Mb to 10.01Mb. Ie:

$ cat _j
@SQ SN:chrM LN:16571
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
...
chr1    10000000    10010000    +   x
chr2    10000000    10010000    +   x
chr3    10000000    10010000    +   x
chr4    10000000    10010000    +   x
chr5    10000000    10010000    +   x
chr6    10000000    10010000    +   x
chr7    10000000    10010000    +   x
chr8    10000000    10010000    +   x
chr9    10000000    10010000    +   x
chr10   10000000    10010000    +   x

I'm running picard ViewSam to extract regions use this interval list, along with strace to dump all the file I/O, thus:

$ strace -f -o /tmp/tr.bam /software/bin/java -jar ~/work/picard-2.20.0.jar ViewSam I=../../NA12878/NA12878_S1_1.3.0.bam ALIGNMENT_STATUS=All PF_STATUS=All INTERVAL_LIST=_j R=~/scratch/data/indices/hgc19.fa|egrep -cv '^@'
...
Runtime.totalMemory()=957874176
52996

Similarly for CRAM:

$ strace -f -o /tmp/tr.cram /software/bin/java -jar ~/work/picard-2.20.0.jar ViewSam I=../../NA12878/NA12878_S1_1.3.0.cram ALIGNMENT_STATUS=All PF_STATUS=All INTERVAL_LIST=_j R=~/scratch/data/indices/hgc19.fa|egrep -cv '^@'
...
Runtime.totalMemory()=3899129856
52996

We can plot these reads of Nth read vs Pth position in file using gnuplot and some hideous sed/awk hackery:

$ sed 's/  */ /g' /tmp/tr.bam | awk -F "[(), ]" '/.*unfinished/ {line[$1]=$0;next} /resumed>/ {print line[$1],$0;next} {print}'|awk -F "[(), ]" 'BEGIN {p=-1;n=0;pos=0} /open.*bam"/ {p=$NF} p==$3 && /close/ {p=-1} p==$3 && /read.*=/ {print n,pos,pos+$NF;pos+=$NF;n++;nr++;nb+=$NF} p==$3 && p==$3 && /seek.*SEEK_SET/ {pos=$5;print ""} p==$3 && /seek.*SEEK_CUR/ {pos+=$5;print ""} END {print "no.reads",nr > "/dev/stderr";print "no.bytes",nb > "/dev/stderr"}' > _pos.picard.bam
no.reads 1249
no.bytes 11019485

$ sed 's/  */ /g' /tmp/tr.cram | awk -F "[(), ]" '/.*unfinished/ {line[$1]=$0;next} /resumed>/ {print line[$1],$0;next} {print}'|awk -F "[(), ]" 'BEGIN {p=-1;n=0;pos=0} /open.*cram"/ {p=$NF} p==$3 && /close/ {p=-1} p==$3 && /read.*=/ {print n,pos,pos+$NF;pos+=$NF;n++;nr++;nb+=$NF} p==$3 && p==$3 && /seek.*SEEK_SET/ {pos=$5;print ""} p==$3 && /seek.*SEEK_CUR/ {pos+=$5;print ""} END {print "no.reads",nr > "/dev/stderr";print "no.bytes",nb > "/dev/stderr"}' > _pos.picard.cram
no.reads 94204
no.bytes 91957194

Or alternatively if you just want the summary stats, use https://github.com/jkbonfield/io_trace with e.g. io_trace -p ../.. -- /software/bin/java -jar /nfs/sam_scratch/rmd/picard.jar ViewSam I=../../NA12878/NA12878_S1_1.3.0.cram ALIGNMENT_STATUS=All PF_STATUS=All INTERVAL_LIST=_j R=~/scratch/data/indices/hgc19.fa

So we see 75x as many read calls and 8x as many bytes read using CRAM as BAM. That's a huge performance hit. Running unix time on my viewsam (without the slow trace) gives me 2.2s for BAM and 19.4s for CRAM. So that's around 9x slower on CRAM than BAM.

On the surface, this appears to be more evidence for #1162, but it's not that simple. Let's look at samtools:

$ awk -F "[(), ]" 'BEGIN {p=-1;pos=0} /open.*bam"/ {p=$NF} p==$2 && /read/ {print nr,pos,pos+$NF;pos+=$NF;nr++;nb+=$NF} p==$2 && /seek/ {pos=$4;print ""} END {print "no.reads",nr > "/dev/stderr";print "no.bytes",nb > "/dev/stderr"}' /tmp/tr_sm.bam > _pos.samtools.bam
no.reads 214
no.bytes 6781387

$ awk -F "[(), ]" 'BEGIN {p=-1;pos=0} /open.*cram"/ {p=$NF} p==$2 && /read/ {print nr,pos,pos+$NF;pos+=$NF;nr++;nb+=$NF} p==$2 && /seek/ {pos=$4;print ""} END {print "no.reads",nr > "/dev/stderr";print "no.bytes",nb > "/dev/stderr"}' /tmp/tr_sm.cram > _pos.samtools.cram
no.reads 82
no.bytes 6786964

Several things spring to mind.

BAM has more read calls than CRAM (214 vs 82)
The number of bytes read back are comparable.
The amount of data fetched back, even for BAM is about 40% less than Picard. Not shown here, but visible from io_trace, is the number of seeks with samtools on BAM is also less - 12 vs 197.

The number of reads returned in all cases (52996) are identical, so clearly the programs work OK. As expected samtools BAM/CRAM to SAM conversion times were considerably faster, and not showing the same ratio of BAM vs CRAM degredation (arounsd 0.2 and 0.5s for BAM and CRAM).

Looking at my _pos.samtools.bam vs _pos.picard.bam I see samtools has a whole load of 32Kb reads (buffered I/O) and htsjdk alternates 18 byte reads and 16-20Kb reads (presumably a bgzf block). Neither are hideously inefficient. However htsjdk did a lot of interim reads at positions that samtools didn't bother to check. Are these the higher up bins being probed in the BAI index to check whether there is data matching there? I don't have a good familiarity with the index format, but I think it's possible to know they're empty without an attempted seek/read to check (linear index maybe? not sure).

We can plot this visually: X axis is Nth read call and Y axis is location in the file. They're actually diagonal lines, but the file is so vast they look horizontal. The length of the line isn't the number of bytes read, but the number of read calls.

Samtools: Samtools/BAM

Picard: Picard/BAM

Much the same regions for the bulk of the I/O, but we can clearly see picard's fetching of other chunks of file.

Now looking at the same chart for CRAM queries...

Samtools: Samtools/CRAM

Picard: Picard/CRAM

Again we see the same tail of extra locations being scanned, but also the horizontal section is visibly no longer horizontal in picard meaning it's reading much much more data. Clearly visible the X axis units are vastly different too. Looking at the log files we see Samtools is mostly multiples of 32Kb reads (due to the buffered hfile layer) while Picard is huge swathes of single byte reads. Eg:

88 372846499 372846500 1
89 372846500 372846501 1
90 372846501 372846502 1
91 372846502 372848795 2293
92 372848795 372848796 1
93 372848796 372848797 1
94 372848797 372848798 1
95 372848798 372848799 1
96 372848799 372848800 1
97 372848800 372848801 1
98 372848801 372848802 1
99 372848802 372848803 1
100 372848803 372848804 1
101 372848804 372848905 101
102 372848905 372848906 1
103 372848906 372848907 1
104 372848907 372848908 1
105 372848908 372848909 1
106 372848909 372848910 1
107 372848910 372848911 1
108 372848911 372848912 1
109 372848912 372848913 1
110 372848913 372848914 1
111 372848914 372848915 1
112 372848915 372848916 1
113 372848916 372849389 473
114 372849389 372849390 1
115 372849390 372849391 1
116 372849391 372849392 1
117 372849392 372849393 1
118 372849393 372849394 1
119 372849394 372849395 1
120 372849395 372849396 1
121 372849396 372849397 1
122 372849397 372849398 1
123 372849398 372849399 1
124 372849399 372849400 1
125 372849400 372849401 1
126 372849401 372849402 1
127 372849402 372908601 59199
128 372908601 372908602 1

This is tragically slow on some file systems. Can it please be replaced with a buffered I/O stream instead to prevent this behaviour?

The second problem is the extra data being read. Picard CRAM read around 14x as many bytes as Samtools CRAM. I think this is the union of more reads that we don't need (see BAM analysis) coupled with the CRAM block size being much larger which exasperates the issue further. (With Samtools CRAM is more compressed, meaning fewer bytes, but the blocks are larger so the wasted bytes for a small range was more - flukily more or less cancelling out.)

I'm curious why we even see the same access pattern in CRAM as BAM though. My cram has a .crai index. Is htsjdk turning this in memory to a .bai index instead, and then using the same bam indexing code (hence demonstrating the same dummy reads)?

In a nutshell, as it stands htsjdk is not performant for CRAM random access, but I also see substantial wins for BAM too.

Edit: I should also add that the CRAM containers and slices have meta-data in them detailing the reference number and start/end, so the rest of the slice does not need to be decoded. This permits fast skipping (seeking) if we know the region being queried won't be in this slice. So even with the additional index queries, it could be made to be more performant.

cmnbroad commented 4 years ago

@jkbonfield Thanks for the detailed analysis. Its certainly the case that htsjdk converts the crai to bai internally, and it does seem to result in lots of unnecessary I/O.

I suspect that spending time on fixing BAI performance issues might be less useful than just switching directly to a native CRAI implementation, since htsjdk already has code to generate CRAI anyway. I'll dive deeper into this as part of my refactoring branch work, which has additional indexing tests and some other perf fixes as well.

jkbonfield commented 4 years ago

Thanks for the response.

I thought it was interesting though to see differences in htslib's BAI usage and htsjdk's BAI usage. I'm sure BAM users would like faster random access too. I can't quite figure out what's going on there.

BAI has an R-tree style index, so there are multiple bins that may potentially hold data covering any specific region. Given the file is sorted, a very long read may overlap a query region but be a considerable distance in the file from all the short read data. Hence the R-tree. I don't really understand how the queries work though, but obviously htslib has a way of culling the bins that don't contain content for any given query, presumably via the linear index component.

CRAI in comparison is unbelievably naive, but can be implemented efficiently with care. Every line in the CRAI (it's gzipped text) is just the same meta-data that appears in the slice headers. Chromosome, start, end, slice offset & size, file offset, etc. The index is sorted by the chr/start coord, but if we have a wide variety of read lengths then possibly we'll get many short slices (by range, not bytes) following a long one and those can be put inside the long range (the "nested" bit), potentially recursively, so that in memory we just have a large array that can be binary searched, and then binary search again within it if it's a container entry. Maybe not the best explanation, but it seems to work. It does lack the linear index component of BAI though so I wonder if it'll be inefficient on some data access patterns.

lbergelson commented 4 years ago

This is interesting. I will try to look into the bad behavior in bam files that you've mentioned here. It's not obvious to me why it would be reading additional chunks but it's clearly doing something suboptimal.

samtools / htsjdk

Inefficient random access patterns #1424