merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
415 stars 142 forks source link

Fix `anvi-script-reformat-fasta` amino acid check #2175

Closed Ge0rges closed 8 months ago

Ge0rges commented 8 months ago

As discussed on discord, the script was not using the correct AA alphabet. That is fixed here along with another bug regarding the existence of flags causing the script to skip over some reads. Before this is merged we should also consider whether the addition of ambiguous characters to the AA alphabet is warranted (e.g. BZJ).

meren commented 8 months ago

You're the best, @Ge0rges! Thank you very much for catching the bug and fixing it :)

Ge0rges commented 8 months ago

@meren what do you think about the ambiguous character question?

meren commented 8 months ago

Ah, sorry, I completely misread that sentence. I am not sure what to suggest for that. Under what circumstances it becomes a necessity? When we convert single-letter alphabets to verbose names?

Ge0rges commented 8 months ago

The case I ran into is that I have protein sequences with these characters (so they are still valid but not seen as so), that I give to anvi-run-ncbi-cogs which it also sees as invalid due to the use (I think) of utils.utils.is_gene_sequence_clean (and/or another part of the script). But I think these characters should be deemed valid inputs for anvio programs.