Open mrvollger opened 3 months ago
working on this, should have a solution soon
Sorry to be a bother but have you had a chance to look at this?
Sorry for the delay - just published bigtools 0.5.1 which should fix the fieldCount and definedFieldCount when an autosql is passed.
Awesome, it's working for me. I'll approve the bioconda PR once it passes tests :)
Hi @jackh726,
Found an issue with the new release that I think is requiring the autosql argument to be passed everytime.
I have a standard three column bedfile and when I try:
zcat results/ONT_heuristic/coverage/unreliable-coverage-regions.bed.gz | bigtools bedtobigbed -s start - ~/assemblies/hg38.analysisSet.fa.fai tmp.bb
I get
Error: Os { code: 2, kind: NotFound, message: "No such file or directory" }
But if I add the SQL argument, it works:
zcat results/ONT_heuristic/coverage/unreliable-coverage-regions.bed.gz | bigtools bedtobigbed -s start -a tmp.as - ~/assemblies/hg38.analysisSet.fa.fai tmp.bb
And here tmp.as
is just the three-column definition:
table bed3 "bed3"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position in chromosome"
uint chromEnd; "End position in chromosome"
)
I am also noticing that I have to filter out lines starting with the BED comment character (#
), whereas I don't think I needed to before, but I am not sure about that.
For now, I can add the as
files for all my commands, but I wanted to let you know.
Cheers, Mitchell
v0.5.2 should at least not error when not passing an autosql (this only happened when passing the bed via stdin)
I still need to add the following: 1) autosql parsing when using stdin 2) Filtering of commented lines
I have not had time to put together a test case to share yet, but now it seems to me that the --autosql
argument is ignored when using stdin
with 0.5.2 whereas it worked with 0.5.1. But I am realizing maybe that is what you meant with:
autosql parsing when using stdin
?
Ah no, that should work but I made a mistake. Will update with fix either later today or tomorrow.
Great thank you!
Sorry for all the questions/comments, but I think I am also noticing that setting the zoom level on the command line is not respected when I check after the file is created with bigbedinfo.
No worries - the extra use/testing is helpful. I'll take a look at that too
Awesome thanks!
I was playing with the zoom level because I thought it might fix an issue I am seeing where some data will disappear as I zoom in when using decorator tracks.
e.g.
to
Ah that's weird, if you can get a reproduction, I can look into it.
Thanks, I made a test case here that reproduces the issue in this region of hg38 chrX:47191183-47195924, and uploaded it here: https://s3-us-west-2.amazonaws.com/stergachis-public1/index.html?prefix=Mitchell/tests/bigtools-test/ And within that dir is a "hub.txt" that you can load on UCSC to see the issue:
When making the small test case, I did find that depending on the size of the dataset, I sometimes would or wouldn't see the issue pop up. This was the smallest file I could get where the issue appeared.
(and my commands are in run.sh
)
Just wondering if you got a chance to look at this? Thanks in advance!
Sorry, this fell off my radar. I'm looking at this now - will let you know.
So, I ran out of time looking at this for now, but here are my general thoughts:
It doesn't seem like this is an issue with the data that is written. I.e. it doesn't look like there's a bug that is causing some data to not be written to the final bigBed.
Rather, I think this is more likely a quark fundamentally attributed to zooms. My hunch is that because you're writing the bigBed from stdin, the heuristics for calculating/including zoom levels just has a poor interaction here. And really, I think it all comes down to the decoration bigBed. I think likely the calculated zoom sizes are just too big.
The easiest check here is just write the declaration bigBed with nzooms=0, which should force the genome browser to use the primary data. I have wanted to add an option to be a little bit more explicit about what zooms to write, so this might be the push to do that. I expect you would then want to manually set the initial zoom to be small.
I'll try to get back to this maybe later today or this week and test out my theory about writing the decoration bigBed with no zooms (if you don't get to it).
Thanks for starting to look into this!
To help out a little I just tested creating from stdin and from a pre-made bed file, and both with a variety of zoom levels (including 0), and unfortunately none of the combinations made these items visible on UCSC.
I extended the script and trackhub so that it also contains a ucsc test just to confirm that it works with the OG tooling:
So I don't think it is something I can fix by forcing the zoom level.
(reminder: I am using bigtools 0.5.1 since 0.5.2 since the --autosql
argument seems to be ignored in this case with 0.5.2 https://github.com/jackh726/bigtools/issues/48#issuecomment-2289489443).
Thanks, this is helpful. I'll keep looking into it, but I'm really not sure!
Just being a squeaky wheel. Thanks in advance!
Sorry! Gotten back to this a bit. Still looking, but it's definitely a zoom-related issue with the decoration track.
Using the ucsc read bigBed with the bigtools decoration bigBed results in the same pattern that you see above (no zooms). Manually specifying 4 zooms (haven't yet added the ability to specify zoom sizes, so the default sizes are 40,160,640,2560) actually results in no decorated items showing. Next step here is for me to manually specify the zooms and see if that fixes it.
As a side point, I'm noting that the basesCovered
summary item (and the others) are not the same across ucsc and bigtools, which is probably unrelated, but interesting. itemCount
is the same (and calling bigBedToBed on both results in the same output).
Just making a note that I am still interested in this.
I'd very much like to move this into our production pipeline since it is a fantastic speed up.
Thanks again for the tool!
Gah! Sorry! I keep forgetting to get back to this... really sorry about the slow progress here.
So, have been investigating this a bit further. And, well, I'm perplexed.
To recap: previously, making a hub with the ucsc data file and bigtools decoration file did not work, but ucsc decoration file and bigtools data file did work.
I added the functionality to manually specify zooms (instead of bigtools calculating them, which is a bit different than ucsc), tried that and it didn't work. So, these none of 0 zooms, 4 zooms (different), or 4 matched zooms work.
As a side point, I'm noting that the
basesCovered
summary item (and the others) are not the same across ucsc and bigtools, which is probably unrelated, but interesting.itemCount
is the same (and calling bigBedToBed on both results in the same output).
I then looked into this, in case it's related. Which ended up being it's own rabbit hole. Essentially, rather than focusing on basesCovered
, I looked at maxDepth
. I made a smaller file (dec.cut1-3.maxdebug.bed.txt) that differed between ucsc and bigtools (ucsc says maxDepth as 200, bigtools says 185). I did a separate pileup with bedtools (bedtools genomecov -bg -g hg38.fai -i dec.cut1-3.maxdebug.bed > dec.cut1-3.maxdebug.pileup.bed) and that shows the max depth as 185...so it seems like a probably bug in the ucsc bedToBigBed, or I'm just missing something.
Along the way, I tried making a bigBed with a single zoom of 1 bp, and realized there was a bug in bigtools (essentially, when there are overlapping regions, I was not counting past the first region). I fixed this, but it still doesn't fix the decoration problem (which, is not that surprising given that no zooms also doesn't work).
So...still working on it.
(I have a testing branch here with ongoing fixes: https://github.com/jackh726/bigtools/tree/dec-testing2)
Okay next weird thing - subsetting the input bed file where the fourth column == "m6A" seems to result in identical outputs...
Thanks for starting to investigate!
I will say the second thing doesn't surprise me that much. For example, depending on the size of the region I was making for the test case different fibers in different regions disappeared.
Would it be helpful to ask UCSC anything? I know a person or two on the browser team.
Yeah, at this point its probably worth asking the ucsc folks.
I've tried "minimizing" the decoration file in different ways prior to running it through bigtools (subsetting to only m6A, removing FIRE lines, only including regions overlapping the target window) and each time i can't reproduce the problem in the genome browser. Subsetting the data file to one entry that is missing the decorations and one that isn't does reproduce the problem though.
I might have some time to poke at this a bit more today. I'm really just perplexed because it doesn't seem like a file format issue (given that just subsetting the initial bed file has yet to reproduce the issue) and some of the decorations are correct even in the reproduction...
I reached out the to creator of the decorator tracks and linked the issue and asked for any advise he might have. I do know he's has since started working on his masters degree so not sure he'll have time to look.
Did I understand you correctly in that you were able to get a single read version that recreates the issue? If so I'd add it to the track hub if you could share the file.
I did have one idea. There is an option in ucsc bedtobigbed that allows a 1bp overlap between blocks in bed12 files which is something I take advantage of with the m6a track specifically. Is that something that might hit some corner case in bigtools?
My test hub is here: https://users.abdenlab.org/hueyj/bigtools-test/hub.txt
re. 1bp overlap (https://github.com/jackh726/bigtools/issues/49?) I think that makes no difference - ucsc does some checks that needs that flag, and bigtools doesn't do those
For my hub (as it is right now, might be down - we're having server troubles):
I was starting to dig through the kent genome browser code to see exactly how the decorator tracks are loaded/processed to see if there's something special happening that relies on exact bigBed layout or something, then we started having server trouble.
Btw, I just published version 0.5.3, which should fix https://github.com/jackh726/bigtools/issues/48#issuecomment-2258917404 and a few other things I found when testing this. It doesn't fix the main issue though, which I unfortunately still haven't found the cause of.
Great, I will update to that version for the other fixes!
One more corner case I have thought of for this data. These bed12 files do sometimes have a zero-length "block/exon". Do you think that could be causing an issue in bigtools?
Nevermind, on the zero length. That is not something I am doing here, got confused with another tool I am working on
Hello,
I have found a discrepancy in the number of fields and some other metadata when using bigtools vs ucsc and it is having an impact on how things appear in the browser.
Basically, when I test with
bigbedinfo
I see differences in field number and other metadata columns (the correct number of fields is 17):I have uploaded all the data in my test to: https://s3-us-west-2.amazonaws.com/stergachis-public1/index.html?prefix=Mitchell/temp/bigtools-test/
And here is the script I am running:
Any help with an obvious error on my part would be great!
Cheers, Mitchell