sourmash-bio / branchwater

Searching large collections of sequencing data with genome-scale queries
https://branchwater.sourmash.bio
Other
6 stars 2 forks source link

running branchwater on large assemblies #14

Open jmattock5 opened 5 months ago

jmattock5 commented 5 months ago

Hello, Thanks for developing such a great tool. I've been trying to run branchwater on some whole metagenome assemblies that are quite large (0.3G-1G). When I upload even the smaller ones and submit them I don't get any output. I've tried leaving a couple with the tab open for ~12 hours to no avail. If I leave them for long enough will they eventually complete? Thanks! Jenny

ctb commented 5 months ago

hi @jmattock5, I don't offhand know what the timeout is for when the branchwater backend will give up, but I can give you some insight into why it is taking so long:

you're running into the problem that the runtime for branchwater-web scales with the size of the query. So, a 10 MB genome will take twice as long as a 5 MB genome to query. That's why it's slow.

(Interestingly, the database functionality underlying this scaling is what makes branchwater possible; handling large queries against a large database is much harder!)

jmattock5 commented 5 months ago

Ok, thanks for the explanation. I'll leave it running for longer and see if anything happens.

Would it be possible to run this locally instead? I have sourmash_plugin_branchwater installed, is the database that branchwater uses shareable?

Thanks, Jenny

ctb commented 5 months ago

IIRC, the on-disk index for branchwater is 1-2 TB. The raw data is in the 8-10 TB range (and that's something that we can search using the plugin). So, umm, probably a bit too large for download :).

@luizirber would the branchwater web site work faster if a signature with a higher scaled value were used? e.g. 100,000 rather than 1000? Does that even work? cc @bluegenes

luizirber commented 5 months ago

Couple of notes on this:

In fact, I tried this crime from the last item with a rumen metagenome I had around (98M originally, 285k after downsample to 100,000), and got 171,599 results back, 285 of those above 20% containment.

So yeah, this definitely works. We can change

luizirber commented 5 months ago

More crimes: got a question on what are these matches, and going thru SRA IDs manually is boring. But the search server only returns SRA IDs and containment, how can I get the same data the web frontend returns? Like this =]

Prepare a request and send to the web frontend:

$ curl -L -H "Content-Type: application/json" \
    --data-binary @<(echo "{\"signatures\": `jq ".| tostring" fake.sig`}") \
    https://branchwater.sourmash.bio/ > fake.json

(this is wrapping fake.sig, which is a k=21,s=100000 signature posing as a s=1000 signature, into the format expected by the branchwater-web API)

Parsing JSON is also boring, so here is a long oneliner to read fake.json, sort by containment, filter only containment >= 0.2, and then count the organism field to check where they are coming from:

$ jq '. | sort_by(.containment) | map(select(.containment >= 0.2)) | .[].organism'  fake.json|sort | uniq -c|sort -nr
    134 "bovine gut metagenome"
     94 "gut metagenome"
     39 "Bos taurus"
     17 "metagenome"
     10 "sheep gut metagenome"
      2 "Cervus nippon"
      2 "bovine metagenome"

So yeah, definitely rumen metagenomes. But not only cow, also got sheep and Sika deer

Finally, without filtering (all matches, even those that are only 0.3%):

$ jq '. | sort_by(.containment) | .[].organism'  fake.json|sort | uniq -c|sort -nr 
    502 "bovine gut metagenome"
    404 "gut metagenome"
    148 "metagenome"
     77 "Bos taurus"
     26 "sheep gut metagenome"
     17 "bovine metagenome"
     14 "goat gut metagenome"
      5 "Ovis aries"
      5 "Bos indicus"
      3 "Elaphurus davidianus"
      3 "Dama dama"
      3 "Cervus nippon"
      3 "Cervus elaphus"
      3 "Cervus albirostris"
      3 "Capra hircus"
      2 "Rusa unicolor"
      2 "Axis porcinus"
      1 "soil metagenome"
      1 "lichen metagenome"
      1 "bioreactor sludge metagenome"