Incorrect number of bytes for empty slots in rWriteLeaves

mdehoon commented 1 year ago

In kent/src/lib/cirTree.c, in the function rWriteLeaves, in this section:

    /* Write out elements of this node. */
    struct rTree *el;
    for (el = tree->children; el != NULL; el = el->next)
        {
        writeOne(f, el->startChromIx);
        writeOne(f, el->startBase);
        writeOne(f, el->endChromIx);
        writeOne(f, el->endBase);
        writeOne(f, el->startFileOffset);
        bits64 size = el->endFileOffset - el->startFileOffset;
        writeOne(f, size);
        }

    /* Write out zeroes for empty slots in node. */
    int i;
    for (i=countOne; i<itemsPerSlot; ++i)
        repeatCharOut(f, 0, indexSlotSize);
    }

I believe indexSlotSize (which is #defined as 24) should be leafSlotSize (which is #defined as 32) to be consistent with the first loop, where 4 x 4 +2 x 8 bytes are written. See the definitions of indexSlotSize and leafSlotSize in the same file:

#define indexSlotSize (24)      /* Size of startChrom,startBase,endChrom,endBase,offset */
#define leafSlotSize (32)       /* Size of startChrom,startBase,endChrom,endBase,offset,size */

maximilianh commented 1 year ago

Hi @mdehoon , can you tell us why you think that? A leaf has a size, an index doesn't have a size.

Also, the bigBed format was published almost two decades ago, we can certainly not play with the byte counts anymore.

I'm curious: why are you looking into this at this detail? Did you find a particular bug or test case that the format failed for you?

mdehoon commented 1 year ago

@maximilianh Thank you for your quick reply.

There is a bigBed parser implementation in Biopython, and I was looking at the source code of bedToBigBed to confirm it works correctly.

I understand that the bigBed format cannot be changed at this point. I am just trying to understand if this really is a bug. If it is, it may leave some parts of the output bigBed file undefined.

The first loop is executed countOne times; each iteration writes 32 bytes. The second loop fills the remaining itemsPerSlot - countOne slots with zeros. Then I would expect this to also write out 32 bytes in each iteration, so in total itemsPerSlot * 32 bytes are written.

For comparison, see writeIndexLevel in bPlusTree.c. In the first loop:

    for (j=i; j<endIx; j += slotSizePer)
        {
        void *item = items + j*itemSize;
        memset(keyBuf, 0, keySize);
        (*fetchKey)(item, keyBuf);
        mustWrite(f, keyBuf, keySize);
        writeOne(f, nextChild);
        nextChild += bytesInNextLevelBlock;
        ++slotsUsed;
        }

we write keySize + sizeof(nextChild) = keySize + sizeof(bits64) bytes in each iteration; in the second loop

   int slotSize = keySize + sizeof(bits64);
    for (j=countOne; j<blockSize; ++j)
        repeatCharOut(f, 0, slotSize);
    }

we fill each of the remaining blockSize - countOne slots with keySize + sizeof(bits64) bytes, i.e. the same number of bytes in each iteration as in the first loop.

maximilianh commented 1 year ago

I think you're right, I just struggle why this has never lead to problems with billions of requests over this time on our website... I'll run it by more senior engineers here and get back to you.

mdehoon commented 1 year ago

Thank you.

I just struggle why this has never lead to problems with billions of requests over this time on our website...

Maybe it's because the bigBed file is interpreted in the same way regardless of whether indexSlotSize or leafSlotSize is used in rWriteLeaves, at least by bigBedToBed, so it may just affect performance.

Using original bedToBigBed:

$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.old.bb
pass1 - making usageList (47 chroms): 94 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2068 millis

After replacing indexSlotSize by leafSlotSize in rWriteLeaves:

$ bedToBigBed -as=bed12.as ucsc.bed hg38.chrom.sizes ucsc.new.bb
pass1 - making usageList (47 chroms): 76 millis
pass2 - checking and writing primary data (251295 records, 12 fields): 2076 millis

This increases the size of the bigBed file a little:

$ ls -l ucsc.old.bb 
-rw-r--r--  1 mdehoon  staff  11892583 Apr  7 08:32 ucsc.old.bb
$ ls -l ucsc.new.bb
-rw-r--r--  1 mdehoon  staff  11909687 Apr  7 08:32 ucsc.new.bb

Running bigBedToBed (without modifications) on these bigBed files results in the exact same bed file:

$ bigBedToBed ucsc.old.bb ucsc.old.bed

$ bigBedToBed ucsc.new.bb ucsc.new.bed

$ ls -l ucsc.old.bed ucsc.new.bed
-rw-r--r--  1 mdehoon  staff  33084176 Apr  7 08:33 ucsc.new.bed
-rw-r--r--  1 mdehoon  staff  33084176 Apr  7 08:33 ucsc.old.bed

$ md5 ucsc.old.bed ucsc.new.bed
MD5 (ucsc.old.bed) = 982fac74c700bc69ce85735a93bbab9c
MD5 (ucsc.new.bed) = 982fac74c700bc69ce85735a93bbab9c

$ diff ucsc.old.bed ucsc.new.bed 

$

maximilianh commented 1 year ago

It looks like this will be backwards compatible, and I still do want to ask others here, but to me, it looks very likely that we'll make this change. This is as good as an error report can be - thank you!

ucscGenomeBrowser / kent

Incorrect number of bytes for empty slots in rWriteLeaves #76