Open mdehoon opened 1 year ago
Hi @mdehoon , can you tell us why you think that? A leaf has a size, an index doesn't have a size.
Also, the bigBed format was published almost two decades ago, we can certainly not play with the byte counts anymore.
I'm curious: why are you looking into this at this detail? Did you find a particular bug or test case that the format failed for you?
@maximilianh Thank you for your quick reply.
There is a bigBed parser implementation in Biopython, and I was looking at the source code of bedToBigBed
to confirm it works correctly.
I understand that the bigBed format cannot be changed at this point. I am just trying to understand if this really is a bug. If it is, it may leave some parts of the output bigBed file undefined.
The first loop is executed countOne
times; each iteration writes 32 bytes.
The second loop fills the remaining itemsPerSlot - countOne
slots with zeros. Then I would expect this to also write out 32 bytes in each iteration, so in total itemsPerSlot
* 32 bytes are written.
For comparison, see writeIndexLevel
in bPlusTree.c
. In the first loop:
for (j=i; j<endIx; j += slotSizePer)
{
void *item = items + j*itemSize;
memset(keyBuf, 0, keySize);
(*fetchKey)(item, keyBuf);
mustWrite(f, keyBuf, keySize);
writeOne(f, nextChild);
nextChild += bytesInNextLevelBlock;
++slotsUsed;
}
we write keySize + sizeof(nextChild)
= keySize + sizeof(bits64)
bytes in each iteration; in the second loop
int slotSize = keySize + sizeof(bits64);
for (j=countOne; j<blockSize; ++j)
repeatCharOut(f, 0, slotSize);
}
we fill each of the remaining blockSize - countOne
slots with keySize + sizeof(bits64)
bytes, i.e. the same number of bytes in each iteration as in the first loop.
I think you're right, I just struggle why this has never lead to problems with billions of requests over this time on our website... I'll run it by more senior engineers here and get back to you.
Thank you.
I just struggle why this has never lead to problems with billions of requests over this time on our website...
Maybe it's because the bigBed file is interpreted in the same way regardless of whether indexSlotSize
or leafSlotSize
is used in rWriteLeaves
, at least by bigBedToBed
, so it may just affect performance.
Using original bedToBigBed:
$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.old.bb
pass1 - making usageList (47 chroms): 94 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2068 millis
After replacing indexSlotSize
by leafSlotSize
in rWriteLeaves
:
$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.new.bb
pass1 - making usageList (47 chroms): 76 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2076 millis
This increases the size of the bigBed file a little:
$ ls -l ucsc.old.bb
-rw-r--r-- 1 mdehoon staff 11892583 Apr 7 08:32 ucsc.old.bb
$ ls -l ucsc.new.bb
-rw-r--r-- 1 mdehoon staff 11909687 Apr 7 08:32 ucsc.new.bb
Running bigBedToBed
(without modifications) on these bigBed files results in the exact same bed file:
$ bigBedToBed ucsc.old.bb ucsc.old.bed
$ bigBedToBed ucsc.new.bb ucsc.new.bed
$ ls -l ucsc.old.bed ucsc.new.bed
-rw-r--r-- 1 mdehoon staff 33084176 Apr 7 08:33 ucsc.new.bed
-rw-r--r-- 1 mdehoon staff 33084176 Apr 7 08:33 ucsc.old.bed
$ md5 ucsc.old.bed ucsc.new.bed
MD5 (ucsc.old.bed) = 982fac74c700bc69ce85735a93bbab9c
MD5 (ucsc.new.bed) = 982fac74c700bc69ce85735a93bbab9c
$ diff ucsc.old.bed ucsc.new.bed
$
It looks like this will be backwards compatible, and I still do want to ask others here, but to me, it looks very likely that we'll make this change. This is as good as an error report can be - thank you!
In
kent/src/lib/cirTree.c
, in the functionrWriteLeaves
, in this section:I believe
indexSlotSize
(which is #defined as 24) should beleafSlotSize
(which is #defined as 32) to be consistent with the first loop, where 4 x 4 +2 x 8 bytes are written. See the definitions ofindexSlotSize
andleafSlotSize
in the same file: