igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
645 stars 387 forks source link

Consider using autoSQL in bigbed files to present user flexibility in choosing label field. (Was Remote BigBed File JASPAR2022_hg19.bb displays matrix ID and not transcription factor name) #1089

Closed malcook closed 2 years ago

malcook commented 2 years ago

Due to an change in the design of the bigbed files (as discussed in wassermanlab/JASPAR-UCSC-tracks#11) , the Matrix ID is displayed as the label on the glyph when loaded in the IGV browser.

eg: https://user-images.githubusercontent.com/484282/148572447-ecbdbed0-b798-4bc5-824b-122608323bfe.png

This display is less useful to most end users than displaying the TF name.

The TF name is now present in column 7 of the underlying bed file instead of column 4 (as before).

UCSC genome browser accommodates the change by continuing to display the TF name.

IGV does not.

Arguably IGV could be improved by displaying column 7 value if present, otherwise displaying the name column (4).

(Note: I brought this up as tangentially to https://github.com/igvteam/igv/issues/1085#issuecomment-1007533381 which was resolved without addressing this tangent, so I thought I'd give it its own issue...)

(Note: A workaround could be to reformat the bigbed to use IGV's neat ability to display GFF column 9 formatted attribute value pairs when they appear in column 4, however you might agree it would be advantageous to use the bigbeds as produced by wassermanlab).

jrobinso commented 2 years ago

I'm not sure what you are suggesting. Obviously we can't have a general rule that if column 7 is present in a bigbed file that it is interpreted as a name.

malcook commented 2 years ago

I agree it is probably a "bad idea"(tm).

But...

Would you then conclude with me that the agreement between the wassermanlab and UCSC that "the bigbeds contain the TF name as an extra field" was a bad idea insofar as it these files do not comport to bigbed spec (despite bigBedToBed happily rematerializing them, as below).

bigBedToBed -chrom=chr1 -start=10001 -end=10005 http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/JASPAR2022_hg19.bb /dev/stdout 
chr1    10001   10018   MA0883.1    328 -   Dmbx1
chr1    10003   10013   MA0599.1    239 +   KLF5
chr1    10003   10015   MA0712.2    275 -   OTX2
chr1    10004   10013   MA0714.1    268 +   PITX3
chr1    10004   10014   MA0467.2    314 -   Crx
chr1    10004   10014   MA0891.1    265 +   GSC2
chr1    10004   10019   MA1574.1    341 -   THRB

FWIW: Consistent with their use of this arguably non-conforming bigbed format, UCSC track configuration provides choice to display 'TF Name' (col 7) instead of MatrixID (col 4) only for 2022 version of this resource, as can be seen in this screenshot:

image

I guess I was suggesting that IGV follow suit somehow, but I understand if you close issue as not really being IGV's.

jrobinso commented 2 years ago

I don't know that there is a "spec" for bed files, after the first 3 columns anything goes. It makes it somewhat challenging. In some contexts I think this bed file would be referred to as "bed6+" as the first 6 columns are standard.

This is a custom UCSC track, in general I don't have the resources, in the parlance of our times, to do custom tracks and we don't host this file in any event.

Its possible we could so something for this problem using the autoSQL, the solution would be (probably) to add a menu item the user could use to choose available columns for name (in this case they would choose TFNAME). If you don't mind we can leave this open and I'll rename it accordingly.

brainstorm commented 2 years ago

There's actually a very recent, official, spec for BED files. It was merged a month ago: https://github.com/samtools/hts-specs/pull/570 ;)

jrobinso commented 2 years ago

@brainstorm ok, great, might be helpful in the future. I don't see how it helps with this situation, however. In fact the document says that it does not specify a means of identifying the contents of columns 4-12. This information must be supplied "out-of-band". These are the columns I am referring to when I say there's not really a spec, they are not nailed down and you have to know what they mean by other means.

Some information about a BED file can only be supplied unambiguously separately from the data
lines of the BED file. This specification does not contain a means for interchanging this information.
Information that must be supplied out-of-band include:
• Which of the first 4 to 12 fields are standard BED fields and which are custom fields.
maximilianh commented 2 years ago

Hi,

Hmm, the Wasserman lab made this bed file and it’s entirely compatible with the bigBed spec, there is nothing different than for other bigBed files.

Why is it a “bad” idea to store the TF name in an extra field ?

On Wed 19 Jan 2022 at 00:31, Jim Robinson @.***> wrote:

@brainstorm https://github.com/brainstorm ok, great, might be helpful in the future. I don't see how it helps with this situation, however. In fact the document says that it does not specify a means of identifying the contents of columns 4-12. This information must be supplied "out-of-band". These are the columns I am referring to when I say there's not really a spec, they are not nailed down and you have to know what they mean by other means.

Some information about a BED file can only be supplied unambiguously separately from the data

lines of the BED file. This specification does not contain a means for interchanging this information.

Information that must be supplied out-of-band include:

• Which of the first 4 to 12 fields are standard BED fields and which are custom fields.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1089#issuecomment-1015928483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TLXZC5Y3BQ64HIKDJDUWXZ6FANCNFSM5MH5W2TQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

jrobinso commented 2 years ago

@maximilianh I don't think it is a bad idea, and I agree its entirely compatible. The issue here is IGV uses the "name" field (column 4) as a label, and @malcook would prefer column 7 for this particular bigBed file. I renamed this issue to suggest the autoSQL might be useful for IGV to present a choice of fields to the user to use for the label. I will look into this possibility when I have time, thus leave the issue open. This is not a bigBed or UCSC issue, sorry for any confusion.

malcook commented 2 years ago

the solution would be (probably) to add a menu item the user could use to choose available columns for name (in this case they would choose TFNAME). If you don't mind we can leave this open and I'll rename it accordingly

who could ask for anything more?

malcook commented 2 years ago

I don't know that there is a "spec" for bed files,

Referring to the samtools BedV1 specification, I see now that the wassermanlab's files might be thought of as "bed6+1" with a single custom field.

I had been looking at https://genome.ucsc.edu/FAQ/FAQformat.html#format1 which purports to define the range for each column and does not refer to custom fields.

jrobinso commented 2 years ago

@malcook Understood, in practice we deal with the files as they exist. I think the autoSql might be helpful here.

maximilianh commented 2 years ago

I don't know the exact reason why the label field was changed, but this is not the only track where we did it like this. The labelField has been a valid trackDb statement for many years.

The trackDb of this track looks like this:

track jaspar compositeTrack on shortLabel JASPAR Transcription Factors longLabel JASPAR Transcription Factor Binding Site Database group regulation visibility hide type bigBed 6 . pennantIcon New red ../goldenPath/newsarch.html#010522 "Released Jan. 6, 2022" url http://jaspar.genereg.net/search?q=$$&collection=all&tax_group=all&tax_id=all&type=all&class=all&family=all&version=all urlLabel View on JASPAR: filter.score 400 filterByRange.score 0:1000 maxItems 100000 maxWindowCoverage 50000 exonArrows on spectrum on

    track jaspar2022
    parent jaspar on
    shortLabel JASPAR 2022 TFBS
    longLabel JASPAR CORE 2022 - Predicted Transcription Factor Binding

Sites priority 1 type bigBed 6 + visibility pack motifPwmTable hgFixed.jasparCore2022 labelFields TFName bigDataUrl /gbdb/$D/jaspar/JASPAR2022.bb

You can see that for this particular track, the field that is used for labeling is not "name" anymore but "TFName". This is not unusual. I guess the problem is that IGV doesn't read our trackDb, but if it's not doing that, then IGV will not be able to display the majority of our tracks as we show them, so this problem is not specific to this particular JASPAR file.

On Wed, Jan 19, 2022 at 3:48 AM Jim Robinson @.***> wrote:

@malcook https://github.com/malcook Understood, in practice we deal with the files as they exist. I think the autoSql might be helpful here.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1089#issuecomment-1016027722, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMYTEWESJKC3GC4H3TUWYRAPANCNFSM5MH5W2TQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

jrobinso commented 2 years ago

From IGV's perspective this is just a bigBed file, so no the trackDB is not read and I'm not even sure how it could be.

jrobinso commented 2 years ago

@maximilianh @malcook Perhaps a general fix this this problem, which maybe you are suggesting, would be to support loading from a track hub rather than directly from the bigBed. Of course loading directly from bigBed will always be supported.

malcook commented 2 years ago

the trackDB is not read and I'm not even sure how it could be

Hmm. Does it seem like I suggested it could? Did you mean to direct this comment to @maximilianh ?

jrobinso commented 2 years ago

@malcook yes (meant for maximilianh).

jrobinso commented 2 years ago

Note to self:

bigBedInfo -as http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/JASPAR2022_hg19.bb
version: 4
fieldCount: 7
hasHeaderExtension: yes
isCompressed: yes
isSwapped: 0
extraIndexCount: 0
itemCount: 12,473,778,656
primaryDataSize: 119,887,888,128
primaryIndexSize: 782,301,588
zoomLevels: 10
chromCount: 93
as:
table JASPAR_TFBS
"TFBS predictions for profiles in the JASPAR CORE collections"
(
    string  chrom;      "Reference sequence chromosome or scaffold"
    uint    chromStart; "Start position of feature on chromosome"
    uint    chromEnd;   "End position of feature on chromosome"
    string  name;       "Matrix ID"
    uint    score;      "Score"
    char[1] strand;     "+ or - for strand"
    string  TFName;     "TF name"
)basesCovered: 2,897,225,363
meanDepth (of bases covered): 46.102859
minDepth: 1.000000
maxDepth: 993.000000
std of depth: 43.105940
maximilianh commented 2 years ago

Track hubs are supported by Ensembl, NCBI and UCSC. So yes, it would be great if IGV had some support for track hubs. A basic version could be very minimal, shortLabel and longLabel, visibility and type are the most important keywords.

jrobinso commented 2 years ago

@maximilianh I will do this, although IGV is not in the same class as the big server based browsers you mention it is certainly worth doing. As a quick fix for JASPAR I'm thinking of just defining a "hosted" track in IGV for at least human and mouse assemblies using the basic data from the trackDB. I will not copy those 100GB bb files rather reference them. Anyway thanks for the tips and help as always.

maximilianh commented 2 years ago

Let us know if we can help with something. The trackDb specs are sometimes not documented well (e.g. genomes and hub.txt). It would be nice to implement useOneFile, I find it very useful, it packs the three files into a single file.

https://genome.ucsc.edu/goldenPath/help/hubQuickStart.html

On Thu, Jan 20, 2022 at 4:35 AM Jim Robinson @.***> wrote:

@maximilianh https://github.com/maximilianh I will do this, although IGV is not in the same class as the big server based browsers you mention it is certainly worth doing. As a quick fix for JASPAR I'm thinking of just defining a "hosted" track in IGV for at least human and mouse assemblies using the basic data from the trackDB. I will not copy those 100GB bb files rather reference them. Anyway thanks for the tips and help as always.

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1089#issuecomment-1017088612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJO4MA7233YZUPQGQDUW57JBANCNFSM5MH5W2TQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

jrobinso commented 2 years ago

Thanks @maximilianh . RE "useOneFile", that would be the decision of the track hub creator, correct? I will support it where its available.

malcook commented 2 years ago

defining a "hosted" track in IGV for at least human and mouse @jrobinso - could you please include zebrafish in any short term patch solution - that is the use case the drove my initial request

malcook commented 2 years ago

@jrobinso - I'm still hoping somehow to be able to display as glyph label in IGV the bigbed's column 6 (TFName). Any chance of providing such functionality, possibly as a "workaround", in the near term (preferably not requiring reference to remote track hubs)?

jrobinso commented 2 years ago

@malcook A workaround would be to convert that file to a standard 12 column bed with the name you want in the standard name column. You can do this with a simple script.

malcook commented 2 years ago

snapshot looks good in my hands. Thanks so much!

jrobinso commented 2 years ago

@malcook I assume you found the "set label field" menu item.