merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

Filter HMM hits based on length #2201

Closed meren closed 5 months ago

meren commented 5 months ago

This PR adds a new parameter to anvi-get-sequences-for-hmm-hits, called --ignore-genes-longer-than. Here is the help menu entry for this parameter:

(...)
  --ignore-genes-longer-than MAX_LENGTH
                        In some cases the gene calling step can
                        identify open reading frames that span across
                        extremely long stretches of genomes. Such
                        mistakes can lead to downstream issues,
                        especially when concatenate flag is used,
                        including failure to align genes. This flag
                        allows anvi'o to ignore extremely long gene
                        calls to avoid unintended issues (i.e., during
                        phylogenomic analyses). If you use this flag,
                        please carefully examine the output messages
                        from the program to see which genes are
                        removed from the analysis. Please note that
                        the length parameter considers the nucleotide
                        lenght of the open reading frame, even if you
                        asked for amino acid sequences to be returned.
                        Setting this parameter to small values, such
                        as less than 10000 nucleotides may lead to the
                        removal of genuine genes, so please use it
                        carefully.

Closes #2200.