Closed HenrivdGeest closed 3 years ago
yeah this seems like a problem with the naming in your isoseq file not Trinity's output which is generating the Trinity_GG results from the shortreads.
The error is coming from PASA in the load step before doing gene model refinement so this seems like data input error not a bug in Trinity/Funannotate?
I recall somebody else having this issue of duplicate isoSeq headers -- I thought there was a check in the code for duplicate headers, but I assume after renaming you shouldn't have this error.
Correct, after I renamed the isoseq input the issue disappeard. Thanks for helping!
okay, sounds like renaming worked so I'll close this.
Are you using the latest release? 1.8.8 from docker
Describe the bug trinity seems to produce transcripts wich have the same names, pasa cannot handle this
What command did you issue?
funannotate train -i /dev/shm/atha/TAIR10_clean_sort_renamed_masked.fasta --out funannotate -l /dev/shm/atha/SRR13498486_1.renamed.fastq.gz -r /dev/shm/atha/SRR13498486_2.renamed.fastq.gz --cpus 32 --no_normalize_reads --no_trimmomatic --max_intronlen 20000 --pacbio_isoseq /dev/shm/atha/isoseq.fasta
There seems to be a single duplicate in the trinity long reads file: └─ $ ▶ grep '>' /dev/shm/atha/with_isoseq/funannotate/training/trinity.long-reads.fasta.clean|sort|uniq -c |sort -rn|head 2 >c15939_1_1162 isoform=c15939;full_length_coverage=1;isoform_length=1162 1 >Trinity_GG_9_c0_g1_i1 len=1657 path=[0:0-1656] 1 >Trinity_GG_99_c0_g1_i1 len=1180 path=[0:0-1179]
Now I am looking at the names of the transcript, the duplicate one does not have the Trinity precursor. I checked my pacbioisoseq sequence headers, and some of them are not unique, I fully renamed these, and I am re-running funannotate train with this set.
Its all public plant data from a 120Mb Arabidopsis genome, so you could reproduce it yourself if you like.