running branchwater on large assemblies

jmattock5 commented 5 months ago

Hello, Thanks for developing such a great tool. I've been trying to run branchwater on some whole metagenome assemblies that are quite large (0.3G-1G). When I upload even the smaller ones and submit them I don't get any output. I've tried leaving a couple with the tab open for ~12 hours to no avail. If I leave them for long enough will they eventually complete? Thanks! Jenny

ctb commented 5 months ago

hi @jmattock5, I don't offhand know what the timeout is for when the branchwater backend will give up, but I can give you some insight into why it is taking so long:

you're running into the problem that the runtime for branchwater-web scales with the size of the query. So, a 10 MB genome will take twice as long as a 5 MB genome to query. That's why it's slow.

(Interestingly, the database functionality underlying this scaling is what makes branchwater possible; handling large queries against a large database is much harder!)

jmattock5 commented 5 months ago

Ok, thanks for the explanation. I'll leave it running for longer and see if anything happens.

Would it be possible to run this locally instead? I have sourmash_plugin_branchwater installed, is the database that branchwater uses shareable?

Thanks, Jenny

ctb commented 5 months ago

IIRC, the on-disk index for branchwater is 1-2 TB. The raw data is in the 8-10 TB range (and that's something that we can search using the plugin). So, umm, probably a bit too large for download :).

@luizirber would the branchwater web site work faster if a signature with a higher scaled value were used? e.g. 100,000 rather than 1000? Does that even work? cc @bluegenes

luizirber commented 5 months ago

Couple of notes on this:

There is a 5MB limit on the size of the signature (here) when received by the search server. Mostly was set up to avoid abuse, but there is no good feedback in the web frontend when it fails to run because of that.
The web frontend will send a larger signature, even if it is guaranteed to fail. Maybe add a check here to see if the signature > 5mb? Or maybe before this code block
We can test the idea of running with a higher scaled (100,000) with a very, very ugly hack: create a sig with s=100000, edit the sig file to use the max_hash value for s=1000, and search should work (the only check is same k and scaled as index, and all the data for s=100000 is contained in the s=1000 index).

In fact, I tried this crime from the last item with a rumen metagenome I had around (98M originally, 285k after downsample to 100,000), and got 171,599 results back, 285 of those above 20% containment.

So yeah, this definitely works. We can change

the frontend to allow selecting a scaled value when doing the sketching,
and the backend to either accept higher scaled values than what the index was built on (because, again, we have the right data for doing these searches), or do downsampling on the fly. I prefer the first one =]

luizirber commented 5 months ago

More crimes: got a question on what are these matches, and going thru SRA IDs manually is boring. But the search server only returns SRA IDs and containment, how can I get the same data the web frontend returns? Like this =]

Prepare a request and send to the web frontend:

$ curl -L -H "Content-Type: application/json" \
    --data-binary @<(echo "{\"signatures\": `jq ".| tostring" fake.sig`}") \
    https://branchwater.sourmash.bio/ > fake.json

(this is wrapping fake.sig, which is a k=21,s=100000 signature posing as a s=1000 signature, into the format expected by the branchwater-web API)

Parsing JSON is also boring, so here is a long oneliner to read fake.json, sort by containment, filter only containment >= 0.2, and then count the organism field to check where they are coming from:

$ jq '. | sort_by(.containment) | map(select(.containment >= 0.2)) | .[].organism'  fake.json|sort | uniq -c|sort -nr
    134 "bovine gut metagenome"
     94 "gut metagenome"
     39 "Bos taurus"
     17 "metagenome"
     10 "sheep gut metagenome"
      2 "Cervus nippon"
      2 "bovine metagenome"

So yeah, definitely rumen metagenomes. But not only cow, also got sheep and Sika deer

Finally, without filtering (all matches, even those that are only 0.3%):

$ jq '. | sort_by(.containment) | .[].organism'  fake.json|sort | uniq -c|sort -nr 
    502 "bovine gut metagenome"
    404 "gut metagenome"
    148 "metagenome"
     77 "Bos taurus"
     26 "sheep gut metagenome"
     17 "bovine metagenome"
     14 "goat gut metagenome"
      5 "Ovis aries"
      5 "Bos indicus"
      3 "Elaphurus davidianus"
      3 "Dama dama"
      3 "Cervus nippon"
      3 "Cervus elaphus"
      3 "Cervus albirostris"
      3 "Capra hircus"
      2 "Rusa unicolor"
      2 "Axis porcinus"
      1 "soil metagenome"
      1 "lichen metagenome"
      1 "bioreactor sludge metagenome"

sourmash-bio / branchwater

running branchwater on large assemblies #14