sourmash-bio / branchwater

Searching large collections of sequencing data with genome-scale queries
https://branchwater.sourmash.bio
Other
6 stars 2 forks source link

match to a query is missing from branchwater Web site #25

Open ctb opened 1 month ago

ctb commented 1 month ago

The accession BK010471 is for a crAssphage that is ubiquitous in human gut metagenomes (link), and in particular is found in the 454 data set SRR073439.

When I do a containment search, I see:

% sourmash search --containment BK010471.fa.sig SRR073439.sig -k 31

selecting specified query k=31
loaded query: BK010471.fa... (k=31, DNA)
--
loaded 3 total signatures from 1 locations.
after selecting signatures compatible with search, 1 remain.

1 matches above threshold 0.080:
similarity   match
----------   -----
 59.0%       SRR073439

and the Venn diagram is pleasing:

venn2

However, the FASTA sequence does not have any matches when searched at https://branchwater.jgi.doe.gov/. Any ideas?

thanks!

SRR073439.k31.sig.zip BK010471.k31.sig.zip BK010471.fa.zip

luizirber commented 1 month ago

This dataset is not in the index, likely due to it having "amplicon" in the metadata (abstract):

... datasets of sequenced bacterial 16S rRNA gene amplicons and total fecal ...

https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR073439&display=metadata

which connects with the discussion in https://github.com/sourmash-bio/branchwater/issues/24#issuecomment-2067814713 =]

ctb commented 1 month ago

It's in wort,

/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs/SRR073439.sig
luizirber commented 1 month ago

It's in wort,

/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs/SRR073439.sig

Yes, see footnotes 1 and 2 here: https://github.com/sourmash-bio/branchwater/issues/24#user-content-fn-1-16d83edf852b4e8c4fb59f87c826ec58 https://github.com/sourmash-bio/branchwater/issues/24#user-content-fn-2-16d83edf852b4e8c4fb59f87c826ec58