hoelzer-lab / hypro

Extend hypothetical prokka protein annotations using additional homology searches against larger databases
GNU General Public License v3.0
9 stars 0 forks source link

mmseqs2 filtering #18

Closed hoelzer closed 4 years ago

hoelzer commented 4 years ago

I think we should at least do some basic filtering of the mmseqs2 results. I think the output is blast-like?

I suggest that we filter for

When you can implement this, maybe do some quick test with ident 80% and aln-length 60% / 80% and compare the number of assigned functional annotations.

What we want to achieve with this simple filter is to mainly avoid annotations that are just based on a partial hit of few nucleotides (e.g. we have a hypo ORF of 100 nt and only find a hit of length 10 nt)

marlt commented 4 years ago

Yes, definietly. Mmseqs comes along with e-value and min-aln-len parameters anyway. The user should be able to handover custom values for this and if not is informed about the default in the help message (those are evalue = 0.001 and aln-len=0). Percent Identity can be filtered afterwards from the table

hoelzer commented 4 years ago

ah nice, yeah evalue and aln-len are already good params and good to have them directly accessible in mmseqs2.