raw-lab / MetaCerberus

Python code for versatile Functional Ontology Assignments for Metagenomes searching via Hidden Markov Model (HMM) with environmental focus of shotgun metaomics data
BSD 3-Clause "New" or "Revised" License
46 stars 7 forks source link

contigs are being subdivided #19

Closed TJrogers86 closed 1 month ago

TJrogers86 commented 2 months ago

Hello, Thanks for the great program! I only have one minor complaint and one possible suggestion. The minor complaint: I noticed that metacerberus looks for N repeats and removes them before it annotates. The issue is I would like to use the .gff output to make a gene map of the viral contigs that I used as input into metacerberus. By removing the N repeats, my viral contigs are being fragmented into smaller contigs and given a number at the end of the name. When using the gff file to make gene maps with the gggenes R package, this causes the fragments to be plotted individually. For example, lets say I have a viral contig named vContig_000000000014||full. After the N repeats are removed, I am left with 3 individual contigs with varying lengths: vContig_000000000014||full_1 (4 kb long), vContig_000000000014||full_2 (30 kb + long), and vContig_000000000014||full_3 (12 kb long). When i go to plot these with gggenes each is ploted on its own (see fig below for example). What I would like to be able to do is just have one gene map of the full contig so that the original bp start and end points are preserved for all genes. Not sure if there is a possible fix for this or not.

As for the suggestion: Would it be possible to have metacerberus create a data frame out put that has all the genes for each contig listed and a column that says if that gene is a viral Auxillary Metabolic Gene if the original inputs were viral in origin? Just a thought. image

decrevi commented 1 month ago

Hello, thank you for your suggestions!

I have made splitting the sequences on N repeats optional, and added a flag --remove-n-repeats if anyone wants to remove the N repeats and split the genomes based on them. This is available from v1.3.1.

I am looking into your suggestion and have made a note for a future update, thanks for the feedback!

FYI we are also pushing out v1.4.0 which replaces Ray with a custom made multiprocessing and distributed processing library that loads faster and works better with MetaCerberus.

Thank you! -Jose

TJrogers86 commented 1 month ago

Thanks!