Open jmattock5 opened 5 months ago
hi @jmattock5, I don't offhand know what the timeout is for when the branchwater backend will give up, but I can give you some insight into why it is taking so long:
you're running into the problem that the runtime for branchwater-web scales with the size of the query. So, a 10 MB genome will take twice as long as a 5 MB genome to query. That's why it's slow.
(Interestingly, the database functionality underlying this scaling is what makes branchwater possible; handling large queries against a large database is much harder!)
Ok, thanks for the explanation. I'll leave it running for longer and see if anything happens.
Would it be possible to run this locally instead? I have sourmash_plugin_branchwater installed, is the database that branchwater uses shareable?
Thanks, Jenny
IIRC, the on-disk index for branchwater is 1-2 TB. The raw data is in the 8-10 TB range (and that's something that we can search using the plugin). So, umm, probably a bit too large for download :).
@luizirber would the branchwater web site work faster if a signature with a higher scaled value were used? e.g. 100,000 rather than 1000? Does that even work? cc @bluegenes
Couple of notes on this:
s=100000
, edit the sig file to use the max_hash
value for s=1000
, and search should work (the only check is same k and scaled as index, and all the data for s=100000
is contained in the s=1000
index).In fact, I tried this crime from the last item with a rumen metagenome I had around (98M originally, 285k after downsample to 100,000), and got 171,599 results back, 285 of those above 20% containment.
So yeah, this definitely works. We can change
More crimes: got a question on what are these matches, and going thru SRA IDs manually is boring. But the search server only returns SRA IDs and containment, how can I get the same data the web frontend returns? Like this =]
Prepare a request and send to the web frontend:
$ curl -L -H "Content-Type: application/json" \
--data-binary @<(echo "{\"signatures\": `jq ".| tostring" fake.sig`}") \
https://branchwater.sourmash.bio/ > fake.json
(this is wrapping fake.sig
, which is a k=21,s=100000
signature posing as a s=1000
signature, into the format expected by the branchwater-web API)
Parsing JSON is also boring, so here is a long oneliner to read fake.json
, sort by containment, filter only containment >= 0.2
, and then count the organism
field to check where they are coming from:
$ jq '. | sort_by(.containment) | map(select(.containment >= 0.2)) | .[].organism' fake.json|sort | uniq -c|sort -nr
134 "bovine gut metagenome"
94 "gut metagenome"
39 "Bos taurus"
17 "metagenome"
10 "sheep gut metagenome"
2 "Cervus nippon"
2 "bovine metagenome"
So yeah, definitely rumen metagenomes. But not only cow, also got sheep and Sika deer
Finally, without filtering (all matches, even those that are only 0.3%):
$ jq '. | sort_by(.containment) | .[].organism' fake.json|sort | uniq -c|sort -nr
502 "bovine gut metagenome"
404 "gut metagenome"
148 "metagenome"
77 "Bos taurus"
26 "sheep gut metagenome"
17 "bovine metagenome"
14 "goat gut metagenome"
5 "Ovis aries"
5 "Bos indicus"
3 "Elaphurus davidianus"
3 "Dama dama"
3 "Cervus nippon"
3 "Cervus elaphus"
3 "Cervus albirostris"
3 "Capra hircus"
2 "Rusa unicolor"
2 "Axis porcinus"
1 "soil metagenome"
1 "lichen metagenome"
1 "bioreactor sludge metagenome"
Hello, Thanks for developing such a great tool. I've been trying to run branchwater on some whole metagenome assemblies that are quite large (0.3G-1G). When I upload even the smaller ones and submit them I don't get any output. I've tried leaving a couple with the tab open for ~12 hours to no avail. If I leave them for long enough will they eventually complete? Thanks! Jenny