transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
54 stars 36 forks source link

Including locusTags alongwith gene names #34

Open microsud opened 5 years ago

microsud commented 5 years ago

Hi, In the DIAMOND_analysis_counter.py the gene products are extracted and a *function.tsv file is outputted. However, due to the large inconsistencies in naming, sometimes the gene names are truncated or missed as well as all hypothetical proteins clubbed as one. Is it possible that an option for getting counts for each locus tag can be introduced? This will also likely give an idea of which locus tag and co-localized genes are actively used for a given genome and also make downstream linking of outputs to custom databases using locus tags more flexible. The output tsv for instance can be formatted to give the following fields:

|-----------------------------------------------------------------------|
| RelativeAbundance | RawCount | LocusTag | GeneName/Product            | 
|-----------------------------------------------------------------------|
| 42.1377616129     | 2877037  | XX_0201  | Dehydrogenase               |
|-----------------------------------------------------------------------|

The locus tag, for instance, can help in pathway enrichment analysis by linking to KEGG orthologs.

Best wishes, Sudarshan Disclaimer: Not a bioinformatician and pardon me if this is a trivial request.