sagnikbanerjee15 / Finder

A fully automated gene annotator from RNA-Seq expression data
MIT License
57 stars 14 forks source link

codan fails and kills pipeline due to finding duplicate key(s) #76

Open laurabaxter21 opened 1 year ago

laurabaxter21 commented 1 year ago

Running the latest run_finder-v1.1.0. Everything runs fine until the codan step (Braker is complete), which finds a duplicate key and kills the pipeline. Looking at the assemblies_psiclass_modified/combined/combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file for duplicated sequence IDs, I find 2 (C2.27447_0_covsplit.0 and C7.149167_0_covsplit.0, both with different sequences in each of the duplicates).

Could I just delete these out from FASTA/gtf and continue from checkpoint 5?

assemblies_psiclass_modified/combined/cds_predict.error:

Traceback (most recent call last): File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 524, in main() File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 506, in main _codanBOTH(options.transcripts, options.output_folder, options.model, options.cpu) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 355, in _codanBOTH _retrieveORFBOTH(transcripts, outF+"minus.fa", outF) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 147, in _retrieveORFBOTH record_dictP = SeqIO.index(transcripts, "fasta") File "/usr/lib/python3/dist-packages/Bio/SeqIO/init.py", line 979, in index return _IndexedSeqFileDict( File "/usr/lib/python3/dist-packages/Bio/File.py", line 350, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'C2.27447_0_covsplit.0'

sagnikbanerjee15 commented 1 year ago

Hello @laurabaxter21,

Thank you very much for your interest in finder. We have decided to focus our attention on developing the 2nd version of the software. As of now, we do not have the capabilities to support the older version due to a lack of personnel and I sincerely apologize for that. If you want to follow up on this please email me at sagnikbanerjee15@gmail.com and I will do my best to help you out.

Thank you.

DrDoom-EvoGen commented 1 year ago

Running the latest run_finder-v1.1.0. Everything runs fine until the codan step (Braker is complete), which finds a duplicate key and kills the pipeline. Looking at the assemblies_psiclass_modified/combined/combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file for duplicated sequence IDs, I find 2 (C2.27447_0_covsplit.0 and C7.149167_0_covsplit.0, both with different sequences in each of the duplicates).

Could I just delete these out from FASTA/gtf and continue from checkpoint 5?

assemblies_psiclass_modified/combined/cds_predict.error:

Traceback (most recent call last): File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 524, in main() File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 506, in main _codanBOTH(options.transcripts, options.output_folder, options.model, options.cpu) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 355, in _codanBOTH _retrieveORFBOTH(transcripts, outF+"minus.fa", outF) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 147, in _retrieveORFBOTH record_dictP = SeqIO.index(transcripts, "fasta") File "/usr/lib/python3/dist-packages/Bio/SeqIO/init.py", line 979, in index return _IndexedSeqFileDict( File "/usr/lib/python3/dist-packages/Bio/File.py", line 350, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'C2.27447_0_covsplit.0'

I am having the same issue. Did you figure out a solution?

laurabaxter21 commented 1 year ago

Hi, yes I recall I just deleted the offending duplicated sequences from the FASTA file and their corresponding entries from the gft file (they didn't seem critically important). Then I re-ran finder from checkpoint 5 and it completed OK.

Hope that helps, Laura


From: Gregory M. Chorak, PhD @.> Sent: 07 June 2023 16:03 To: sagnikbanerjee15/Finder @.> Cc: Baxter, Laura @.>; Mention @.> Subject: Re: [sagnikbanerjee15/Finder] codan fails and kills pipeline due to finding duplicate key(s) (Issue #76)

Running the latest run_finder-v1.1.0. Everything runs fine until the codan step (Braker is complete), which finds a duplicate key and kills the pipeline. Looking at the assemblies_psiclass_modified/combined/combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file for duplicated sequence IDs, I find 2 (C2.27447_0_covsplit.0 and C7.149167_0_covsplit.0, both with different sequences in each of the duplicates).

Could I just delete these out from FASTA/gtf and continue from checkpoint 5?

assemblies_psiclass_modified/combined/cds_predict.error:

Traceback (most recent call last): File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 524, in main() File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 506, in main codan_BOTH(options.transcripts, options.output_folder, options.model, options.cpu) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 355, in codan_BOTH retrieveORF_BOTH(transcripts, outF+"minus.fa", outF) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 147, in retrieveORF_BOTH record_dictP = SeqIO.index(transcripts, "fasta") File "/usr/lib/python3/dist-packages/Bio/SeqIO/init.py", line 979, in index return _IndexedSeqFileDict( File "/usr/lib/python3/dist-packages/Bio/File.py", line 350, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'C2.27447_0_covsplit.0'

I am having the same issue. Did you figure out a solution?

— Reply to this email directly, view it on GitHubhttps://github.com/sagnikbanerjee15/Finder/issues/76#issuecomment-1581011133, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLU2GXLSA533TUDYDT4HB3XKCJ2RANCNFSM6AAAAAAWDCCUWU. You are receiving this because you were mentioned.Message ID: @.***>

DrDoom-EvoGen commented 1 year ago

That worked for me also.

Thank you!

Greg