sourmash-bio / wort

A database for signatures of public genomic sources
https://wort.sourmash.bio
Other
17 stars 2 forks source link

running a gubaphage query #40

Open rchikhi opened 3 years ago

rchikhi commented 3 years ago

Hi Luiz, Titus,

I was wondering if you could please run a Wort query across all metagenomes for the Serratus people? I seem to have understood from your talk that the best way to ask for it is through Github issues, but let me know if you prefer another channel. The goal here is to search for gubaphages in metagenomes. I'm attaching both DNA and protein signatures: Gubaphage_genomes.sigs.zip that I computed for a collection (multifasta) of gubaphages as follows:

sourmash compute --dna Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.dna.sig
sourmash compute -k 6,9,15  --protein Gubaphage_genomes.fa
mv Gubaphage_genomes.fa.sig Gubaphage_genomes.protein.sig

thanks in advance! Rayan

bluegenes commented 3 years ago

Hi @rchikhi,

Happy to run some DNA queries first (we don't have protein running quite yet). However, this search requires scaled signatures, and yours seem to be Mash-compatible signatures (probably due to your reported python 3.6 installation issue, https://github.com/dib-lab/sourmash/issues/1561).

Could you try installing sourmash in an isolated environment, and then recalculating sigs, please?

installation: conda create -n sourmash4 -c conda-forge -c bioconda sourmash

activate conda environment: conda activate sourmash4

signature generation:

sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund Gubaphage_genomes.fa -o Gubaphage_genomes.dna.sig
rchikhi commented 3 years ago

Great! oh for some reason I didn't get an email notification for this reply. Thanks for the detailed instructions, here are the updated sigs: Gubaphage_genomes.dna.sig.zip

bluegenes commented 3 years ago

Hi @rchikhi! I ran the query at k=21.

see https://github.com/bluegenes/2021-gubaphage-magsearch for code & a notebook where I did some light processing and filtration of the results. Click on the binder if you'd like to run the notebook interactively.

In that notebook, I selected results metagenome results that had at least 30% (output file: gubaphage.sra-search.k21-c30.csv) or at least 10% containment (output file: gubaphage.sra-search.k21-c10.csv) of the gubaphage query. These csvs can be found in the output.gubaphage-magsearch/processed_results folder, which you can see on the associated OSF repo. They can also be regenerated via the binder.

The processed results look like this:

search_genome,metagenome,containment
Gubaphage_genomes.fa,ERR2683220,0.506
Gubaphage_genomes.fa,ERR2683247,0.472
Gubaphage_genomes.fa,ERR2607412,0.439
Gubaphage_genomes.fa,ERR2592250,0.439
Gubaphage_genomes.fa,ERR2683203,0.432
Gubaphage_genomes.fa,ERR2683256,0.426
Gubaphage_genomes.fa,ERR2683152,0.419

...where containment is the fraction of your gubaphage query covered by the metagenome.

What other information would be helpful, or what questions can I answer about this run? I can also search at k=31 and k=51 if you'd like - I'm not sure what (if any) additional metagenomes would be recovered.

cheers,

Tessa

rchikhi commented 3 years ago

Thanks much Tessa! I'll have a look at the results and let you know if more info is needed on our side.

rchikhi commented 3 years ago

Hi Tessa, quick question: the query was a pangenome, i.e. a collection of all known genomes for that clade. I suspect there's also some redundancy, i.e. some genomes inside this pangenome are very similar. Would it be correct to say that a containment score of 0.5 means that.. essentially 50% of the data inside that query pangenome has a hit? (With the truth being somewhere between "50% of the entries inside that FASTA file having a full hit and 50% have no hit", and "100% of the entries match over half their length").