sourmash-bio / sourmash_plugin_containment_search

An improved `search --containment` for sourmash
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

sourmash_plugin_containment_search: improved containment search for genomes in metagenomes

This plugin provides two commands sourmash scripts mgsearch and sourmash scripts mgmanysearch, that provide new & nicer outputs for searching for genomes in metagenomes. It is a plugin for the sourmash software.

Background

Reporting the presence and estimating the abundance of queries in data sets is a core requirement of many bioinformatics analyses - metagenomics in particular.

This plugin provides two commands that use k-mers to estimate the presence and abundance of queries in data sets. Use cases include:

The plugin uses FracMinHash-based estimation to calculate k-mer detection and estimate coverage based on k-mer multiplicity. These numbers correspond closely to mapping-based detection and coverage.

The plugin outputs Average Nucleotide Identity estimates using the approach described in Rahman Hera et al., 2023 and implemented in sourmash.

Installation

To install this plugin, run:

pip install sourmash_plugin_containment_search

(This will install sourmash if you do not already have it installed.)

Usage

This plugin enables two commands, mgsearch and mgmanysearch.

mgsearch - search for a single query in many data sets

This command:

sourmash scripts mgsearch query.sig metagenome.sig [ metagenome2.sig ...] \
    [ -o output.csv ]

will search for the query genome query.sig in one or more metagenome.sig files, producing decent human-readable output and (optionally) useful CSV outputs.

For example,

sourmash scripts mgsearch ../sourmash/podar-ref/0.fa.sig ../sourmash/SRR606249.trim.k31.sig.gz

produces:

Loaded query signature: CP001472.1 Acidobacterium capsulatum ATCC 51196, com...

p_genome avg_abund   p_metag   metagenome name
-------- ---------   -------   ---------------
 100.0%    55.4         3.1%   SRR606249

This plugin will work with all the standard sourmash database types, too.

Note that the metagenomes must have been sketched with -p abund to enable the avg_abund and p_metag columns.

mgmanysearch - search for many queries in many data sets

This command:

sourmash scripts mgmanysearch --queries query1.sig [ query2.sig ... ]\
    --against metagenome.sig [ metagenome2.sig ...] \
    [ -o output.csv ]

will search for the queries query*.sig in one or more metagenome*.sig files, producing decent human-readable output and (optionally) useful CSV outputs.

Backstory: Why this command?

sourmash search supports sample search x sample search, broadly - perhaps too broadly. And the output formats aren't that helpful.

sourmash prefetch supports metagenome overlap search against many genomes, which is the reverse of this use case. Moreover, prefetch doesn't provided weighted results and its output isn't friendly.

sourmash gather has friendly and useful output, but can't be used to calculate the overlap between a single query genome and many subject metagenomes.

There is also some interest in reverse containment search.

The manysearch command of the sourmash branchwater plugin also does a nice containment search like this plugin, but it doesn't provide nice human-readable output and it also doesn't provide weighted results. (manysearch is, however, much lower memory & probably a fair bit faster because it's mostly in Rust.)

Advanced info: implementation details

This command is streaming, in the sense that it will load each metagenome, calculate the match, and then discard the metagenome. Hence its memory usage peaks with the largest metagenome, and its max should be driven by the size of the query + the size of the largest metagenome.

CSV output

Each row contains the following information.

Comparison details

Sketch information

Query (genome) information:

Match (metagenome) information:

Support

We suggest filing issues in the main sourmash issue tracker as that receives more attention!

Dev docs

containment_search is developed at https://github.com/ctb/sourmash_plugin_containment_search.

Generating a release

Bump version number in pyproject.toml and push.

Make a new release on github.

Then pull, and:

python -m build

followed by twine upload dist/....