merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
441 stars 146 forks source link

Config Error: The entry name appears twice need --just-do-it #1206

Closed jvhagey closed 5 years ago

jvhagey commented 5 years ago

Hi Anvi'o team, I am running anvi-run-hmm on a custom set of hmms from the FOAM database (https://academic.oup.com/nar/article/42/19/e145/2902479). Currently, anvio (v5.5, installed via conda in its own environment) gives a config error that correctly points out that there are entry names that appear more than once in the genes.txt file.

Config Error: The entry name KO:K00362_1.7.1.4 appears twice in the TAB-delimited file '/share
              /tearlab/Maga/Jill/CDRF_MetaGenome/Assembly_2018/Anvio/FOAM_2018/Foam_Nitro/gene
              s.txt'. We don't think that you did that purposefully (if you think this should
              be Ok, then feel free to contact us).

The hmm file from the FOAM data has duplicate names like this:

NAME  KO:K00362_1.7.1.4
ACC   HMMsoil94843
--
NAME  KO:K00362_1.7.1.4
ACC   HMMsoil95976
--
NAME  KO:K00362_1.7.1.4
ACC   HMMsoil96110

I had run this same custom hmm with Anvi'o 5.3 without this error so it seems like a new guard rail. Can we put in a --just-do-it argument as I did mean to do this or is that going to screw things up horribly?

Anvi'o version ...............................: margaret (v5.5)
Profile DB version ...........................: 31
Contigs DB version ...........................: 12
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1
meren commented 5 years ago

Hi @jvhagey,

Apologies for the late reply. I just played with this a bit to turn that error into a warning for users who know what they're doing, but I realized that it causes another problem now. This is probably associated with the new release of HMMER. I'm using v3.2 for your reference.

When I have multiple HMM entires with the same NAME and still different ACC properties, hmmpress gives me the following error:

Working...    SSI index construction failed:
  primary keys not unique: 'GENE_NAME' occurs more than once

Can you try and tell me if this is the case for you when you use hmmpress with your HMMs file? If you are not getting the same error can you please share the HMMER version you're using?

Thanks,

jvhagey commented 5 years ago

I used hmmer v3.1b2 and didn't get that error when I used hmmpress, it's good to know for the future though. Maybe since this shouldn't happen in the future given hmmer newest version there isn't a big reason to change this in anvio. To get around this for my purposes I just wrote a short little script that rewrites the hmm file with numbered names for duplicate entries.

meren commented 5 years ago

Thank you, Jill. I think your solution is the most reasonable approach given the rare need for this. Alternatively anvi'o could have given a warning instead of an error, and fail gracefully if hmmpress returns an error depending on the user version, but I think it will be an overkill :)

I am closing this issue for now and will update the current error message so it is clear to people why we are not allowing that.

Thanks for your patience.

Best wishes,