igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
634 stars 379 forks source link

Support bigRmsk #1236

Open jrobinso opened 1 year ago

jrobinso commented 1 year ago

"bigRmsk" files will currently not work in IGV due to negative block starts. See issue #1130. The format is valid BED9, but we need the extra fields. Fix is to properly interpret the negative block starts.

Autosql: http://genome.ucsc.edu/goldenPath/help/examples/bigRmskBed.as Also see: http://genome.ucsc.edu/goldenPath/help/examples/bigRmskAlignBed.as

maximilianh commented 1 year ago

bigRmsk is a perfectly valid BED9. Does IGV not read the BED Type from the bigBed file? The bigBed header stores the number of defined fields (9 here) and so IGV should not even try to parse fields 10-12…

jrobinso commented 1 year ago

@maximilianh Perhaps I didn't phrase this correctly, will correct it, yes its perfectly valid but not really useful for display without properly treating the extra fields.

In general we read the number of defined fields and don't try to parse extra fields, but for certain special fields we do. This came up previously, and you had indicated (I think) that the defined names (e.g. blockstarts in this case) are never re-defined and if found in an extra field should be safe to use. So when we see "blockstarts" and "blocksizes" in extra fields we use them. This came up, I think, in "bigCat" format but perhaps for different fields.

In the bigRmsk case fields 10-12 are neccessary to do a reasonable display. Now that I know that blockstarts can be negative to table name bigRmsk I will just make a special case.

So yes I could strictly just look at defined fields but for bigGenePred, bigCat, and bigRmsk this would yield suboptimal displays, e.g. without the blocks. The extra fields are not just comments.

jrobinso commented 1 year ago

Another one we look for is exonFrames. So the complete set of "extra" fields currently parsed are as follows

blockCount 
blockSizes
blockStarts
exonFrames

If we see these attribute names we assume they are as defined for bed format, or in the case of exonFrames for genePredExt format. This has worked so far except for this special case, I need to dig more deeply into this to determine what IGV will do with the negative blockStarts, thus I opened this ticket for myself, there's no implication that anything is incorrect in the files. Just unexpected.

maximilianh commented 1 year ago

Maybe I didn't explain this well: the header of all bigBed files contains the number of fields that are "parsable".

It's here in the C code: https://github.com/ucscGenomeBrowser/kent/blob/34d24ed91f2d7a04467840c6763532345e4c0529/src/lib/bbiRead.c#L116

/ Read rest of defined bits of header, byte swapping as needed. / bbi->version = udcReadBits16(udc, isSwapped); bbi->zoomLevels = udcReadBits16(udc, isSwapped); bbi->chromTreeOffset = udcReadBits64(udc, isSwapped); bbi->unzoomedDataOffset = udcReadBits64(udc, isSwapped); bbi->unzoomedIndexOffset = udcReadBits64(udc, isSwapped); bbi->fieldCount = udcReadBits16(udc, isSwapped);

bbi->fieldCode should be "9" here. And IGV should not parse more than 9 fields as a consequence. This is independent of any field names in the autoSql definition.

I imagine the javascript parser also reads this field?

On Wed, Oct 12, 2022 at 6:38 AM Jim Robinson @.***> wrote:

Another one we look for is exonFrames. So the complete set of "extra" fields current parsed are as follows

blockCount blockSizes blockStarts exonFrames

If we see these attribute names we assume they are as defined for bed format, or in the case of exonFrames for genePredExt format. This has worked so far except for this special case, I need to dig more deeply into this to determine what IGV will do with the negative blockStarts, thus I opened this ticket for myself, there's no implication that anything is incorrect in the files. Just unexpected.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1236#issuecomment-1275577642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNZZMIKIYZYZ4QNINTWCY6DXANCNFSM6AAAAAARCSW6T4 . You are receiving this because you were mentioned.Message ID: @.***>

jrobinso commented 1 year ago

@maximilianh Yes I understand, and fieldCount is used. However its necessary to parse some of the extra fields to support IGV tracks, otherwise the "big" versions of formats such as genePredExt/bigGenePred and rmsk/bigRmsk will be a regression from the point of view of the user, for example for rmsk we will not be able to draw the blocks, for bigGenePred we will not be able to do protein translation without "exonFrames" . This is not a problem in general, this special case of unexpected negative blockStarts just needs to be handled. This is not a major problem.