jackh726 / bigtools

A high-performance BigWig and BigBed library in Rust
MIT License
70 stars 5 forks source link

Missing fields when using `-a` with bedtobigbed #48

Open mrvollger opened 3 months ago

mrvollger commented 3 months ago

Hello,

I have found a discrepancy in the number of fields and some other metadata when using bigtools vs ucsc and it is having an impact on how things appear in the browser.

Basically, when I test with bigbedinfo I see differences in field number and other metadata columns (the correct number of fields is 17):

# bigtools
version: 4
fieldCount: 3
isCompressed: yes
isSwapped: 0
itemCount: 1100122467552
primaryDataSize: 367,623,108
primaryIndexSize: 20,200
zoomLevels: 6
chromCount: 2
basesCovered: 97,494,196
meanDepth: 44.449296
minDepth: 1.000000
maxDepth: 3195.000000
std of depth: 69.740202
version: 4

# UCSC
fieldCount: 17
isCompressed: yes
isSwapped: 0
itemCount: 1100122467552
primaryDataSize: 386,514,961
primaryIndexSize: 46,952
zoomLevels: 8
chromCount: 2
basesCovered: 101,269,035
meanDepth: 82.861921
minDepth: 1.000000
maxDepth: 4056.000000
std of depth: 96.352168

I have uploaded all the data in my test to: https://s3-us-west-2.amazonaws.com/stergachis-public1/index.html?prefix=Mitchell/temp/bigtools-test/

And here is the script I am running:

zcat my.bed.gz | bigtools bedtobigbed -a decoration.as -s start - hg38.fai bigtools.bb
bedToBigBed -as=decoration.as -type=bed12+ my.bed.gz hg38.fai ucsc.bb

bigbedinfo bigtools.bb
bigbedinfo ucsc.bb

Any help with an obvious error on my part would be great!

Cheers, Mitchell

jackh726 commented 3 months ago

working on this, should have a solution soon

mrvollger commented 3 months ago

Sorry to be a bother but have you had a chance to look at this?

jackh726 commented 3 months ago

Sorry for the delay - just published bigtools 0.5.1 which should fix the fieldCount and definedFieldCount when an autosql is passed.

mrvollger commented 3 months ago

Awesome, it's working for me. I'll approve the bioconda PR once it passes tests :)

mrvollger commented 3 months ago

Hi @jackh726,

Found an issue with the new release that I think is requiring the autosql argument to be passed everytime.

I have a standard three column bedfile and when I try:

zcat results/ONT_heuristic/coverage/unreliable-coverage-regions.bed.gz | bigtools bedtobigbed -s start  - ~/assemblies/hg38.analysisSet.fa.fai tmp.bb

I get

Error: Os { code: 2, kind: NotFound, message: "No such file or directory" }

But if I add the SQL argument, it works:

zcat results/ONT_heuristic/coverage/unreliable-coverage-regions.bed.gz | bigtools bedtobigbed -s start -a tmp.as - ~/assemblies/hg38.analysisSet.fa.fai tmp.bb

And here tmp.as is just the three-column definition:

table bed3 "bed3"
(
string  chrom;  "Reference sequence chromosome or scaffold"
uint    chromStart; "Start position in chromosome"
uint    chromEnd;   "End position in chromosome"
)

I am also noticing that I have to filter out lines starting with the BED comment character (#), whereas I don't think I needed to before, but I am not sure about that.

For now, I can add the as files for all my commands, but I wanted to let you know.

Cheers, Mitchell

jackh726 commented 2 months ago

v0.5.2 should at least not error when not passing an autosql (this only happened when passing the bed via stdin)

I still need to add the following: 1) autosql parsing when using stdin 2) Filtering of commented lines

mrvollger commented 2 months ago

I have not had time to put together a test case to share yet, but now it seems to me that the --autosql argument is ignored when using stdin with 0.5.2 whereas it worked with 0.5.1. But I am realizing maybe that is what you meant with:

autosql parsing when using stdin

?

jackh726 commented 2 months ago

Ah no, that should work but I made a mistake. Will update with fix either later today or tomorrow.

mrvollger commented 2 months ago

Great thank you!

Sorry for all the questions/comments, but I think I am also noticing that setting the zoom level on the command line is not respected when I check after the file is created with bigbedinfo.

jackh726 commented 2 months ago

No worries - the extra use/testing is helpful. I'll take a look at that too

mrvollger commented 2 months ago

Awesome thanks!

I was playing with the zoom level because I thought it might fix an issue I am seeing where some data will disappear as I zoom in when using decorator tracks.

e.g.

image

to

image
jackh726 commented 2 months ago

Ah that's weird, if you can get a reproduction, I can look into it.

mrvollger commented 2 months ago

Thanks, I made a test case here that reproduces the issue in this region of hg38 chrX:47191183-47195924, and uploaded it here: https://s3-us-west-2.amazonaws.com/stergachis-public1/index.html?prefix=Mitchell/tests/bigtools-test/ And within that dir is a "hub.txt" that you can load on UCSC to see the issue:

image

When making the small test case, I did find that depending on the size of the dataset, I sometimes would or wouldn't see the issue pop up. This was the smallest file I could get where the issue appeared.

(and my commands are in run.sh)

mrvollger commented 1 month ago

Just wondering if you got a chance to look at this? Thanks in advance!

jackh726 commented 1 month ago

Sorry, this fell off my radar. I'm looking at this now - will let you know.

jackh726 commented 1 month ago

So, I ran out of time looking at this for now, but here are my general thoughts:

It doesn't seem like this is an issue with the data that is written. I.e. it doesn't look like there's a bug that is causing some data to not be written to the final bigBed.

Rather, I think this is more likely a quark fundamentally attributed to zooms. My hunch is that because you're writing the bigBed from stdin, the heuristics for calculating/including zoom levels just has a poor interaction here. And really, I think it all comes down to the decoration bigBed. I think likely the calculated zoom sizes are just too big.

The easiest check here is just write the declaration bigBed with nzooms=0, which should force the genome browser to use the primary data. I have wanted to add an option to be a little bit more explicit about what zooms to write, so this might be the push to do that. I expect you would then want to manually set the initial zoom to be small.

jackh726 commented 1 month ago

I'll try to get back to this maybe later today or this week and test out my theory about writing the decoration bigBed with no zooms (if you don't get to it).

mrvollger commented 1 month ago

Thanks for starting to look into this!

To help out a little I just tested creating from stdin and from a pre-made bed file, and both with a variety of zoom levels (including 0), and unfortunately none of the combinations made these items visible on UCSC.

I extended the script and trackhub so that it also contains a ucsc test just to confirm that it works with the OG tooling:

image

So I don't think it is something I can fix by forcing the zoom level.

(reminder: I am using bigtools 0.5.1 since 0.5.2 since the --autosql argument seems to be ignored in this case with 0.5.2 https://github.com/jackh726/bigtools/issues/48#issuecomment-2289489443).

jackh726 commented 1 month ago

Thanks, this is helpful. I'll keep looking into it, but I'm really not sure!

mrvollger commented 1 month ago

Just being a squeaky wheel. Thanks in advance!

jackh726 commented 1 month ago

Sorry! Gotten back to this a bit. Still looking, but it's definitely a zoom-related issue with the decoration track.

Using the ucsc read bigBed with the bigtools decoration bigBed results in the same pattern that you see above (no zooms). Manually specifying 4 zooms (haven't yet added the ability to specify zoom sizes, so the default sizes are 40,160,640,2560) actually results in no decorated items showing. Next step here is for me to manually specify the zooms and see if that fixes it.

As a side point, I'm noting that the basesCovered summary item (and the others) are not the same across ucsc and bigtools, which is probably unrelated, but interesting. itemCount is the same (and calling bigBedToBed on both results in the same output).

mrvollger commented 1 week ago

Just making a note that I am still interested in this.

I'd very much like to move this into our production pipeline since it is a fantastic speed up.

Thanks again for the tool!

jackh726 commented 1 week ago

Gah! Sorry! I keep forgetting to get back to this... really sorry about the slow progress here.

jackh726 commented 1 week ago

So, have been investigating this a bit further. And, well, I'm perplexed.

To recap: previously, making a hub with the ucsc data file and bigtools decoration file did not work, but ucsc decoration file and bigtools data file did work.

I added the functionality to manually specify zooms (instead of bigtools calculating them, which is a bit different than ucsc), tried that and it didn't work. So, these none of 0 zooms, 4 zooms (different), or 4 matched zooms work.

As a side point, I'm noting that the basesCovered summary item (and the others) are not the same across ucsc and bigtools, which is probably unrelated, but interesting. itemCount is the same (and calling bigBedToBed on both results in the same output).

I then looked into this, in case it's related. Which ended up being it's own rabbit hole. Essentially, rather than focusing on basesCovered, I looked at maxDepth. I made a smaller file (dec.cut1-3.maxdebug.bed.txt) that differed between ucsc and bigtools (ucsc says maxDepth as 200, bigtools says 185). I did a separate pileup with bedtools (bedtools genomecov -bg -g hg38.fai -i dec.cut1-3.maxdebug.bed > dec.cut1-3.maxdebug.pileup.bed) and that shows the max depth as 185...so it seems like a probably bug in the ucsc bedToBigBed, or I'm just missing something.

Along the way, I tried making a bigBed with a single zoom of 1 bp, and realized there was a bug in bigtools (essentially, when there are overlapping regions, I was not counting past the first region). I fixed this, but it still doesn't fix the decoration problem (which, is not that surprising given that no zooms also doesn't work).

So...still working on it.

(I have a testing branch here with ongoing fixes: https://github.com/jackh726/bigtools/tree/dec-testing2)

jackh726 commented 1 week ago

Okay next weird thing - subsetting the input bed file where the fourth column == "m6A" seems to result in identical outputs...

mrvollger commented 1 week ago

Thanks for starting to investigate!

I will say the second thing doesn't surprise me that much. For example, depending on the size of the region I was making for the test case different fibers in different regions disappeared.

Would it be helpful to ask UCSC anything? I know a person or two on the browser team.

jackh726 commented 1 week ago

Yeah, at this point its probably worth asking the ucsc folks.

I've tried "minimizing" the decoration file in different ways prior to running it through bigtools (subsetting to only m6A, removing FIRE lines, only including regions overlapping the target window) and each time i can't reproduce the problem in the genome browser. Subsetting the data file to one entry that is missing the decorations and one that isn't does reproduce the problem though.

I might have some time to poke at this a bit more today. I'm really just perplexed because it doesn't seem like a file format issue (given that just subsetting the initial bed file has yet to reproduce the issue) and some of the decorations are correct even in the reproduction...

mrvollger commented 1 week ago

I reached out the to creator of the decorator tracks and linked the issue and asked for any advise he might have. I do know he's has since started working on his masters degree so not sure he'll have time to look.

Did I understand you correctly in that you were able to get a single read version that recreates the issue? If so I'd add it to the track hub if you could share the file.

mrvollger commented 1 week ago

I did have one idea. There is an option in ucsc bedtobigbed that allows a 1bp overlap between blocks in bed12 files which is something I take advantage of with the m6a track specifically. Is that something that might hit some corner case in bigtools?

jackh726 commented 1 week ago

My test hub is here: https://users.abdenlab.org/hueyj/bigtools-test/hub.txt

re. 1bp overlap (https://github.com/jackh726/bigtools/issues/49?) I think that makes no difference - ucsc does some checks that needs that flag, and bigtools doesn't do those

jackh726 commented 1 week ago

For my hub (as it is right now, might be down - we're having server troubles):

jackh726 commented 1 week ago

I was starting to dig through the kent genome browser code to see exactly how the decorator tracks are loaded/processed to see if there's something special happening that relies on exact bigBed layout or something, then we started having server trouble.

jackh726 commented 2 days ago

Btw, I just published version 0.5.3, which should fix https://github.com/jackh726/bigtools/issues/48#issuecomment-2258917404 and a few other things I found when testing this. It doesn't fix the main issue though, which I unfortunately still haven't found the cause of.

mrvollger commented 6 hours ago

Great, I will update to that version for the other fixes!

One more corner case I have thought of for this data. These bed12 files do sometimes have a zero-length "block/exon". Do you think that could be causing an issue in bigtools?

mrvollger commented 2 hours ago

Nevermind, on the zero length. That is not something I am doing here, got confused with another tool I am working on