Closed ReverendCasy closed 3 years ago
Hi Yury,
Sorry for the delay in getting back to you. If you are hitting an error at lines 286-288 I think you might have an annotation that is completely duplicated both in ID and location.
There is actually a script included in Panaroo that can help in converting annotation from NCBI for use with Panaroo. The script can be found at /scripts/convert_refseq_to_prokka_gff.py
. If you run this on your gff files prior to running panaroo it should hopefully prevent the need to modify the code in the main pipeline.
Hi Gerry, Thank you for the reply. We used the script as suggested but then bumped into another error identical to the one described here: https://github.com/gtonkinhill/panaroo/issues/105. I wonder whether any other solution for this one except for removing "erroneous" assemblies has appeared since then. By the way, it's likely that my initial question itself is a duplicate of https://github.com/gtonkinhill/panaroo/issues/73 - sorry for not spotting it timely.
Best wishes, Yury
Hi Yury,
I'm afraid I was not able to reproduce that error as I did not have the input files. If you are able to send me a smallish example that reproduces the problem I would be happy to take a look. My guess is that it is an edge case in one of the annotations that we don't handle properly yet.
Don't stress about duplication. The documentation needs improving and I am hoping to add an FAQ section soon.
In case it helps, my email is gt4@sanger.ac.uk
Yes, I have the similar issue using download NCBI gbk files. for example, https://www.ncbi.nlm.nih.gov/assembly/GCF_000011705.1 Thanks
Hello, Working with a set of Salmonella genome annotations downloaded from NCBI Assembly, I stumbled upon the following problem at the GFF parsing step of Panaroo (v1.2.8, downloaded via Conda):
Looking at the troublemaker assembly, I found the following occurrences of the duplicate ID:
In this example, duplicate IDs refer to the reading frame encoding a signal peptide for similar operons. Several other assemblies from my dataset which cause Panaroo to crush contain duplicates referring to pseudogenes and multiple coding sequences arising from frameshifts. One workaround I found for this case is to explicitly add the
merge_strategy='create_unique'
argument togff.create_db()
functions in prokka.py and _findmissing.py scripts. With these changes, Panaroo works fine with my dataset; however, when I expand my dataset up to ~1500 assemblies, _findmissing.py crashes at the duplicate check (if-loop at lines 286-288 in the current version). This can be remedied by just commenting the loop off, and the tool completes the task smoothly (please write if you need an error log for this one as well). By now, I can either leave the results obtained with the modified (eh) code or re-annotate all the assemblies with Prokka, but I do not know whether the same problem would not arise further, hence help is appreciated.Thanks in advance, Yury