Duplicate ID handling in NCBI-distributed annotations

ReverendCasy commented 3 years ago

Hello, Working with a set of Salmonella genome annotations downloaded from NCBI Assembly, I stumbled upon the following problem at the GFF parsing step of Panaroo (v1.2.8, downloaded via Conda):

"""
Traceback (most recent call last):
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/gffutils/create.py", line 589, in _populate_from_lines
    self._insert(f, c)
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/gffutils/create.py", line 530, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
    r = call_item()
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/panaroo/prokka.py", line 107, in get_gene_sequences
    parsed_gff = gff.create_db(clean_gff_string(split[0]),
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/gffutils/create.py", line 1292, in create_db
    c.create()
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/gffutils/create.py", line 507, in create
    self._populate_from_lines(self.iterator)
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/gffutils/create.py", line 591, in _populate_from_lines
    fixed, final_strategy = self._do_merge(f, self.merge_strategy)
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/gffutils/create.py", line 226, in _do_merge
    raise ValueError("Duplicate ID {0.id}".format(f))
ValueError: Duplicate ID SEN4356A
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/panaroo/prokka.py", line 244, in process_prokka_input
    gene_sequence_list = Parallel(n_jobs=n_cpu)(
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1061, in __call__
    self.retrieve()
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 940, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/reverend_casy/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/reverend_casy/anaconda3/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
ValueError: Duplicate ID SEN4356A

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/reverend_casy/anaconda3/bin/panaroo", line 10, in <module>
    sys.exit(main())
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/panaroo/__main__.py", line 286, in main
    process_prokka_input(args.input_files, args.output_dir,
  File "/home/reverend_casy/anaconda3/lib/python3.8/site-packages/panaroo/prokka.py", line 256, in process_prokka_input
    raise RuntimeError("Error reading prokka input!")
RuntimeError: Error reading prokka input!

Looking at the troublemaker assembly, I found the following occurrences of the duplicate ID:

AM933172    GenBank gene    2   16  .   +   1   ID=SEN4356A;Name=thrL;locus_tag=SEN4356A
AM933172    GenBank gene    4685802 4685846 .   +   1   ID=SEN4356A;Name=thrL;locus_tag=SEN4356A
AM933172    GenBank mRNA    2   4685846 .   +   1   ID=SEN4356A.t01;Parent=SEN4356A
AM933172    GenBank signal_peptide  1236368 1236439 .   +   1   Parent=SEN4356A;Name=thrL;Note=Signal peptide predicted for SEN1146 by SignalP 2.0 HMM (Signal peptide probability 0.920) with cleavage site probability 0.337 between residues 24 and 25;locus_tag=SEN4356A
AM933172    GenBank CDS 2   16  .   +   1   ID=SEN4356A.p01;Parent=SEN4356A.t01;Dbxref=EnsemblGenomes-Gn:SEN4356A,EnsemblGenomes-Tr:CAR35910,GOA:B5R3D3,InterPro:IPR011720;Name=thrL;codon_start=1;locus_tag=SEN4356A;product=threonine operon leader peptide (artificial fragment);protein_id=CAR35910.2;transl_table=11;translation=length.20
AM933172    GenBank CDS 4685802 4685846 .   +   1   ID=SEN4356A.p01;Parent=SEN4356A.t01;Dbxref=EnsemblGenomes-Gn:SEN4356A,EnsemblGenomes-Tr:CAR35910,GOA:B5R3D3,InterPro:IPR011720;Name=thrL;codon_start=1;locus_tag=SEN4356A;product=threonine operon leader peptide (artificial fragment);protein_id=CAR35910.2;transl_table=11;translation=length.20
AM933172    GenBank exon    2   16  .   +   1   Parent=SEN4356A.t01
AM933172    GenBank exon    4685802 4685846 .   +   1   Parent=SEN4356A.t01

In this example, duplicate IDs refer to the reading frame encoding a signal peptide for similar operons. Several other assemblies from my dataset which cause Panaroo to crush contain duplicates referring to pseudogenes and multiple coding sequences arising from frameshifts. One workaround I found for this case is to explicitly add the merge_strategy='create_unique' argument to gff.create_db() functions in prokka.py and _findmissing.py scripts. With these changes, Panaroo works fine with my dataset; however, when I expand my dataset up to ~1500 assemblies, _findmissing.py crashes at the duplicate check (if-loop at lines 286-288 in the current version). This can be remedied by just commenting the loop off, and the tool completes the task smoothly (please write if you need an error log for this one as well). By now, I can either leave the results obtained with the modified (eh) code or re-annotate all the assemblies with Prokka, but I do not know whether the same problem would not arise further, hence help is appreciated.

Thanks in advance, Yury

gtonkinhill commented 3 years ago

Hi Yury,

Sorry for the delay in getting back to you. If you are hitting an error at lines 286-288 I think you might have an annotation that is completely duplicated both in ID and location.

There is actually a script included in Panaroo that can help in converting annotation from NCBI for use with Panaroo. The script can be found at /scripts/convert_refseq_to_prokka_gff.py. If you run this on your gff files prior to running panaroo it should hopefully prevent the need to modify the code in the main pipeline.

ReverendCasy commented 3 years ago

Hi Gerry, Thank you for the reply. We used the script as suggested but then bumped into another error identical to the one described here: https://github.com/gtonkinhill/panaroo/issues/105. I wonder whether any other solution for this one except for removing "erroneous" assemblies has appeared since then. By the way, it's likely that my initial question itself is a duplicate of https://github.com/gtonkinhill/panaroo/issues/73 - sorry for not spotting it timely.

Best wishes, Yury

gtonkinhill commented 3 years ago

Hi Yury,

I'm afraid I was not able to reproduce that error as I did not have the input files. If you are able to send me a smallish example that reproduces the problem I would be happy to take a look. My guess is that it is an edge case in one of the annotations that we don't handle properly yet.

Don't stress about duplication. The documentation needs improving and I am hoping to add an FAQ section soon.

In case it helps, my email is gt4@sanger.ac.uk

comingkms commented 2 years ago

Yes, I have the similar issue using download NCBI gbk files. for example, https://www.ncbi.nlm.nih.gov/assembly/GCF_000011705.1 Thanks

gtonkinhill / panaroo

Duplicate ID handling in NCBI-distributed annotations #131