nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

duplicate trinity seqeunce found #590

Closed HenrivdGeest closed 3 years ago

HenrivdGeest commented 3 years ago

Are you using the latest release? 1.8.8 from docker

Describe the bug trinity seems to produce transcripts wich have the same names, pasa cannot handle this

What command did you issue?

funannotate train -i /dev/shm/atha/TAIR10_clean_sort_renamed_masked.fasta --out funannotate -l /dev/shm/atha/SRR13498486_1.renamed.fastq.gz -r /dev/shm/atha/SRR13498486_2.renamed.fastq.gz --cpus 32 --no_normalize_reads --no_trimmomatic --max_intronlen 20000 --pacbio_isoseq /dev/shm/atha/isoseq.fasta


[May 06 02:03 AM]: Running PASA alignment step using 44,629 transcripts
[May 06 02:03 AM]: CMD ERROR: /venv/opt/pasa-2.4.1/Launch_PASA_pipeline.pl -c /dev/shm/atha/with_isoseq/funannotate/training/pasa/alignAssembly.txt -r -C -R -g /dev/shm/atha/with_isoseq/funannotate/training/genome.fasta --IMPORT_CUSTOM_ALIGNMENTS /dev/shm/atha/with_isoseq/funannotate/training/trinity.alignments.gff3 -T -t /dev/shm/atha/with_isoseq/funannotate/training/trinity.long-reads.fasta.clean -u /dev/shm/atha/with_isoseq/funannotate/training/trinity.long-reads.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --ALT_SPLICE --MAX_INTRON_LENGTH 20000 --CPU 32 --ALIGNERS blat --trans_gtf /dev/shm/atha/with_isoseq/funannotate/training/funannotate_train.stringtie.gtf
[May 06 02:03 AM]: -connecting to SQLite db: /dev/shm/atha/with_isoseq/funannotate/training/pasa/TAIR10_clean_sort_renamed_masked_pasa
-*** Running PASA pipeine:
* [Thu May  6 02:03:11 2021] Running CMD: /venv/opt/pasa-2.4.1/scripts/create_sqlite_cdnaassembly_db.dbi -c /dev/shm/atha/with_isoseq/funannotate/training/pasa/alignAssembly.txt -S '/venv/opt/pasa-2.4.1/schema/cdna_alignment_sqliteschema' -r
* [Thu May  6 02:03:11 2021] Running CMD: samtools faidx /dev/shm/atha/with_isoseq/funannotate/training/genome.fasta
* [Thu May  6 02:03:11 2021] Running CMD: samtools faidx /dev/shm/atha/with_isoseq/funannotate/training/trinity.long-reads.fasta.clean
[W::fai_insert_index] Ignoring duplicate sequence "c15939_1_1162" at byte offset 62345034
* [Thu May  6 02:03:11 2021] Running CMD: /venv/opt/pasa-2.4.1/scripts/upload_transcript_data.dbi -M '/dev/shm/atha/with_isoseq/funannotate/training/pasa/TAIR10_clean_sort_renamed_masked_pasa' -t /dev/shm/atha/with_isoseq/funannotate/training/trinity.long-reads.fasta.clean  -f NULL 
DBD::SQLite::db do failed: UNIQUE constraint failed: cdna_info.cdna_acc at /venv/opt/pasa-2.4.1/PerlLib/DB_connect.pm line 221, <$filehandle> line 44005.
failed query: < insert into cdna_info (cdna_acc, is_assembly, is_fli, is_TDN, length, header) values (?,?,?,?,?,?) >    values: c15939_1_1162 0 0 0 1162 c15939_1_1162 isoform=c15939;full_length_coverage=1;isoform_length=1162
Errors: UNIQUE constraint failed: cdna_info.cdna_acc
 at /venv/opt/pasa-2.4.1/PerlLib/DB_connect.pm line 233, <$filehandle> line 44005.
        DB_connect::RunMod(DB_connect=HASH(0x558f301c6b30), " insert into cdna_info (cdna_acc, is_assembly, is_fli, is_TDN"..., "c15939_1_1162", 0, 0, 0, 1162, "c15939_1_1162 isoform=c15939;full_length_coverage=1;isoform_l"...) called at /venv/opt/pasa-2.4.1/scripts/upload_transcript_data.dbi line 97
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::SQLite::db handle database=/dev/shm/atha/with_isoseq/funannotate/training/pasa/TAIR10_clean_sort_renamed_masked_pasa;host=localhost.

There seems to be a single duplicate in the trinity long reads file: └─ $ ▶ grep '>' /dev/shm/atha/with_isoseq/funannotate/training/trinity.long-reads.fasta.clean|sort|uniq -c |sort -rn|head 2 >c15939_1_1162 isoform=c15939;full_length_coverage=1;isoform_length=1162 1 >Trinity_GG_9_c0_g1_i1 len=1657 path=[0:0-1656] 1 >Trinity_GG_99_c0_g1_i1 len=1180 path=[0:0-1179]

Now I am looking at the names of the transcript, the duplicate one does not have the Trinity precursor. I checked my pacbioisoseq sequence headers, and some of them are not unique, I fully renamed these, and I am re-running funannotate train with this set.

Its all public plant data from a 120Mb Arabidopsis genome, so you could reproduce it yourself if you like.

hyphaltip commented 3 years ago

yeah this seems like a problem with the naming in your isoseq file not Trinity's output which is generating the Trinity_GG results from the shortreads.

The error is coming from PASA in the load step before doing gene model refinement so this seems like data input error not a bug in Trinity/Funannotate?

nextgenusfs commented 3 years ago

I recall somebody else having this issue of duplicate isoSeq headers -- I thought there was a check in the code for duplicate headers, but I assume after renaming you shouldn't have this error.

HenrivdGeest commented 3 years ago

Correct, after I renamed the isoseq input the issue disappeard. Thanks for helping!

nextgenusfs commented 3 years ago

okay, sounds like renaming worked so I'll close this.