transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
54 stars 36 forks source link

SEED results for a selected organism #58

Closed elcega closed 3 years ago

elcega commented 3 years ago

Is there a script so obtain the SEED subsystems results from a specific organism instead of from all the existing organisms in the samples?

transcript commented 3 years ago

Hello, yes, this is possible!

First, you'll want to annotate your entire metatranscriptome against both the RefSeq and the Subsystems databases.

Next, the approach I've used is to run the DIAMOND_specific_organism_retriever.py tool (https://github.com/transcript/samsa2/blob/master/python_scripts/DIAMOND_specific_organism_retriever.py) on the RefSeq results, giving the name of the specific organism with the -SO flag. This should output a results file that contains only the hits to that organism.

Third, you would run db_results_swapper.py (https://github.com/transcript/samsa2/blob/master/python_scripts/db_results_swapper.py) on these files, providing the filtered-to-your-organism RefSeq file as the input with -I, and the full list of Subsystems annotations with the -A flag as the annotation file.

This db_results_swapper.py script should make a dictionary of all the hits in your organism-specific RefSeq results, and then print out each matching Subsystems annotation that matches that same original read in the outfile. You can then run the Subsystems analysis counter on this output to see the breakdown of different functions.

Best, Sam