First off, thanks for developing this tool! I've been wanting to import IMG annotations into my anvi'o contigs databases but didn't have a method of doing so until now.
However, I have run into an issue when trying to use the IMG source option. I downloaded a test genome from my data a couple of weeks ago and it appears that there might have been a change in source column provided by IMG as it only contains "img_core_v400" as a source for every row. This is different from the IMG example you provide in the test directory where the source column contains e.g. "Prodigal v2.6.3" or "GeneMark.hmm-2 v1.05" as sources. This caused an error:
Traceback (most recent call last):
File "gff_parser.py", line 64, in <module>
source, version = feature.source.split(SEP, 1)
ValueError: not enough values to unpack (expected 2, got 1)
It appears it broke since the version and the source are separated by a '_' rather than ' '. As a temporary fix I went in and changed the code to match my specific use case to the following:
source = "_".join(feature.source.split("_")[:2]) # results in img_core
version = feature.source.split("_")[2] # results in v400
which led to source = "img_core" and version = "v400" for all my rows as expected.
After running this updated code it finished but gave the output message: Done. All 1699 have been processed succesfully. There were 0 coding sequences, 0 RNAs, and 0 unknown features.
I am unsure where this next issue might lie since I am certain that there should be plenty of complete coding sequences based on other analyses. I also am unsure whether my fix might have had additional consequences that I did not realize that might have disrupted something downstream to lead to this result.
Any suggestions on what I might be doing wrong would be greatly appreciated!
Thanks!
Oscar
Hi @kolaban, I stumbled upon your post because I was having the same issue. Have you tried using the --process-all flag? Otherwise, the parser skips all non-Prodigal sources.
Hey,
First off, thanks for developing this tool! I've been wanting to import IMG annotations into my anvi'o contigs databases but didn't have a method of doing so until now.
However, I have run into an issue when trying to use the IMG source option. I downloaded a test genome from my data a couple of weeks ago and it appears that there might have been a change in source column provided by IMG as it only contains "img_core_v400" as a source for every row. This is different from the IMG example you provide in the test directory where the source column contains e.g. "Prodigal v2.6.3" or "GeneMark.hmm-2 v1.05" as sources. This caused an error:
It appears it broke since the version and the source are separated by a '_' rather than ' '. As a temporary fix I went in and changed the code to match my specific use case to the following:
which led to source = "img_core" and version = "v400" for all my rows as expected.
After running this updated code it finished but gave the output message:
Done. All 1699 have been processed succesfully. There were 0 coding sequences, 0 RNAs, and 0 unknown features.
I am unsure where this next issue might lie since I am certain that there should be plenty of complete coding sequences based on other analyses. I also am unsure whether my fix might have had additional consequences that I did not realize that might have disrupted something downstream to lead to this result.
Any suggestions on what I might be doing wrong would be greatly appreciated! Thanks! Oscar