nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 86 forks source link

Funannotate train crash at Pasa step with iso-seq + RNA-seq #326

Closed eyalbenda closed 4 years ago

eyalbenda commented 5 years ago

I've installed the latest funannotate docker image (changed manually the path to 1.6.0 in the installation instructions). Am running the following command to assemble a genome using pacbio iso-seq and single end RNA seq as evidence:

funannotate train -i Dba.dovetail.masked.fasta -s *.fastq.gz --pacbio_isoseq DbaIsoSeq.fasta --species "Dunaliella" --cpus 2 --max_intronlen 30000 -o out

However, I'm getting a crash in Pasa with the following error:

Traceback (most recent call last): File "/home/linuxbrew/funannotate/bin/funannotate-train.py", line 967, in runPASAtrain(genome, trinity_transcripts, cleanTranscripts, os.path.abspath(trinityGFF3), stringtieGTF, args.stranded, args.max_intronlen, args.cpus, organism_name, PASA_tmp) File "/home/linuxbrew/funannotate/bin/funannotate-train.py", line 405, in runPASAtrain with open(os.path.join(folder, pasaDBname+'.pasa_assemblies_described.txt'), 'rU') as description: IOError: [Errno 2] No such file or directory: 'out/training/pasa/Dunaliella_bardawil.pasa_assemblies_described.txt'

nextgenusfs commented 5 years ago

Will be helpful to post the erroring part of the logfiles for train and for PASA.

eyalbenda commented 5 years ago

Of course, see attachment. funannotate-train.log

nextgenusfs commented 5 years ago

Looks like maybe transcripts didn't get indexed, i.e.

Thread 2 terminated abnormally: Error, no seek pos for acc: ScySeS1_10 at /home/linuxbrew/PASApipeline/PerlLib/Fasta_retriever.pm line 100 thread 2.
    Fasta_retriever::get_seq(Fasta_retriever=HASH(0x55bded17ad50), "ScySeS1_10") called at /home/linuxbrew/PASApipeline/scripts/assemble_clusters.dbi line 166 thread 2
    main::assemble_transcripts_on_scaffold("ScySeS1_10") called at /home/linuxbrew/PASApipeline/scripts/assemble_clusters.dbi line 135 thread 2
    eval {...} called at /home/linuxbrew/PASApipeline/scripts/assemble_clusters.dbi line 135 thread 2

Is this only a problem when you use Illumina and PacBio?

eyalbenda commented 5 years ago

I deleted the pasa directory and reran funannotate train, this time removing the iso-seq reads (don't know if that's enough, if not I can rerun the whole thing overnight):

funannotate train -i Dba.dovetail.masked.fasta -s *.fastq.gz --species "Dunaliella bardawil" --cpus 2 --max_intronlen 30000 -o out

The error file was too big, so I sent you a google-drive link by email

nextgenusfs commented 5 years ago

Okay thanks. Will try to take a look after work. And one other follow-up, the RNA-seq tests run without error on your system? ie

funannotate test -t rna-seq --cpus 2
eyalbenda commented 5 years ago

Thanks Jon! I'm running the test, the prediction is still going, but the training proceeded without error. I'll run the training overnight with only the illumina reads and see if it helps isolate the issue to including the iso-seq.

--

Eyal Ben-David, PhD Post-doctoral Scholar Lab of Prof. Leonid Kruglyak Department of Human Genetics David Geffen School of Medicine UCLA

On Mon, Sep 23, 2019 at 1:04 PM Jon Palmer notifications@github.com wrote:

Okay thanks. Will try to take a look after work. And one other follow-up, the RNA-seq tests run without error on your system? ie

funannotate test -i rna-seq --cpus 2

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/326?email_source=notifications&email_token=ACQHDYLOS6Y64R5MPNP2LWLQLEONPA5CNFSM4IZLK4YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MC6ZQ#issuecomment-534261606, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQHDYJUT7OL6I5QS65QQ5DQLEONPANCNFSM4IZLK4YA .

nextgenusfs commented 5 years ago

The error in the log file seems to say an alignment exists but the sequence doesn't. This is probably due to re-running the command with different input, ie funannotate will re-use any data that is available, so if you ran with Illumina + PacBio (which failed) and then re-ran with only Illumina, it will still re-use the alignments if they are present.

Lets see if the only illumina works. The other thing to try would be just the PacBio -- it will help me to fix if I know that error only happens when you pass both RNA-seq data.

nextgenusfs commented 5 years ago

Any updates here? Did you find a solution that worked? I'm recalling something about Pacbio fasta headers that might causing some problems with PASA.

eyalbenda commented 5 years ago

Thanks Jon, I ended up just using the Illumina data and dropping the iso-seq, since that seemed to work. Please let me know if I should close the issue, or if there is anything I can try on my end, I'd be happy to help!

nextgenusfs commented 5 years ago

I'd like to figure out what was causing the error -- any chance you'd be able to sub sample the data to a smaller reproducible dataset that still causes the error? That is of course if you'd be willing to share this smaller test set with me confidentially so I can determine what the error is?

eyalbenda commented 5 years ago

I totally understand, and I would if I could. This isn't my own dataset, unfortunately. I will retry annotating a newer assembly version in a few weeks, please let me know if I should close the issue in the meantime.

--

Eyal Ben-David, PhD Post-doctoral Scholar Lab of Prof. Leonid Kruglyak Department of Human Genetics David Geffen School of Medicine UCLA

On Wed, Nov 6, 2019 at 2:45 PM Jon Palmer notifications@github.com wrote:

I'd like to figure out what was causing the error -- any chance you'd be able to sub sample the data to a smaller reproducible dataset that still causes the error? That is of course if you'd be willing to share this smaller test set with me confidentially so I can determine what the error is?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/326?email_source=notifications&email_token=ACQHDYOT7MXJTDPYOWOMGIDQSNCI7A5CNFSM4IZLK4YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDIIDCQ#issuecomment-550535562, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQHDYLBLQPUOJBPRD4RW43QSNCI7ANCNFSM4IZLK4YA .

nextgenusfs commented 5 years ago

I'll leave this open for now -- but would still like to hear if still an issue with the latest release. thanks!

eyalbenda commented 4 years ago

Sorry for the late response. I asked but my collaborators are uncomfortable with sharing any data for debugging. I don't agree with their reluctance but it is what it is. I tried annotating again and funannotate train crashes in assembling pasa clusters. This is the relevant logfiles - I tried manually the command from the pasa log and I'm parsing below the error output.

pasa /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/scripts/assembly_db_loader.dbi -M '/home/multivac/NanoSystem/Dunaliella/D.bardawil/fun/training/pasa/Dunaliella_bardawil' > pasa_run.log.dir/alignment_assembly_loading.out
-connecting to SQLite db: /home/multivac/NanoSystem/Dunaliella/D.bardawil/fun/training/pasa/Dunaliella_bardawil
WARNING: previous assemblies have been loaded...  Purging them first before attempting a re-load
-cleaning previous asmbls from alignment table
-cleaning previous asmbls from align_link table
-cleaning previous asmbls from cdna_info table
-purging asmbl_link
done cleaning.

Now, on to re-loading.

Committed 10000 PASA assemblies       DBD::SQLite::db do failed: UNIQUE constraint failed: asmbl_link.asmbl_acc, asmbl_link.cdna_acc at /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/PerlLib/DB_connect.pm line 221.
failed query: <insert into asmbl_link (asmbl_acc, cdna_acc) values (?,?)>       values: asmbl_10356 transcript
Errors: UNIQUE constraint failed: asmbl_link.asmbl_acc, asmbl_link.cdna_acc
 at /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/PerlLib/DB_connect.pm line 233.
        DB_connect::RunMod(DB_connect=HASH(0x559c56423ff8), "insert into asmbl_link (asmbl_acc, cdna_acc) values (?,?)", "asmbl_10356", "transcript") called at /home/multivac/NanoSystem/anaconda37/envs/funannotate/opt/pasa-2.4.1/scripts/assembly_db_loader.dbi line 153
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::SQLite::db handle database=/home/multivac/NanoSystem/Dunaliella/D.bardawil/fun/training/pasa/Dunaliella_bardawil;host=localhost.
eyalbenda commented 4 years ago

Another small issue, it's possible that because the output of the Isoseq QC pipeline is fasta, and because funannotate train is importing it as if it were fastq, I lose half the reads (at least it reports half the reads in the log files).

nextgenusfs commented 4 years ago

Looks like the same problem -- what does a grep of the sequence headers for asmbl_10356 result in? Looks like it is parsing the headers as asmbl_10356 is the asmbl_acc and transcript as the cdna_acc. So likely the transcript is causing the problem.

eyalbenda commented 4 years ago

Thanks for the help. I've run it again without the Isoseq data but when I finish I'll go back and try to debug it looking at that transcript.

eyalbenda commented 4 years ago

This is the output of alignment_assembly_loading.out for this cluster. I couldn't find any other reference to asmbl_16018 (the number has changed since I had to rerun it):

// assembly asmbl_16018 in cluster: 3358 blat.proc26359.chain_6280 transcript 2537 blat.proc26359.chain_6281 transcript 3505 orient(a-/s-) align: 7330984(1635)-7331986(633)>CT....AC<7332460(632)-7332505(587)>CT....AC<7332749(586)-7332816(519)>CT....AC<7333239(518)-7333293(464)>CT....AC<7333740(463)-7333815(388)>CT....AC<7333975(387)-7334038(324)>CT....AC<7334693(323)-7334747(269)>CT....AC<7335182(268)-7335260(190)>CT....AC<7337081(189)-7337136(134)>CT....AC<7337275(133)-7337407(1)

eyalbenda commented 4 years ago

Don't know if this helps, but searching for these chains in the blat gff yields the following lines:

chr05 BLAT cDNA_match 7337275 7337407 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 1 133 + chr05 BLAT cDNA_match 7337081 7337136 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 134 189 + chr05 BLAT cDNA_match 7335182 7335260 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 190 268 + chr05 BLAT cDNA_match 7334693 7334747 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 269 323 + chr05 BLAT cDNA_match 7333975 7334038 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 324 387 + chr05 BLAT cDNA_match 7333740 7333815 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 388 463 + chr05 BLAT cDNA_match 7333239 7333293 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 464 518 + chr05 BLAT cDNA_match 7332749 7332816 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 519 586 + chr05 BLAT cDNA_match 7332460 7332505 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 587 632 + chr05 BLAT cDNA_match 7330984 7331986 100.00 - . ID=blat.proc26359.chain_6280;Target=transcript/2537 633 1635 + chr05 BLAT cDNA_match 7337275 7337389 99.13 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 1 114 + chr05 BLAT cDNA_match 7337081 7337136 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 115 170 + chr05 BLAT cDNA_match 7335182 7335260 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 171 249 + chr05 BLAT cDNA_match 7334693 7334747 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 250 304 + chr05 BLAT cDNA_match 7333975 7334038 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 305 368 + chr05 BLAT cDNA_match 7333740 7333815 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 369 444 + chr05 BLAT cDNA_match 7333239 7333293 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 445 499 + chr05 BLAT cDNA_match 7332749 7332816 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 500 567 + chr05 BLAT cDNA_match 7332460 7332505 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 568 613 + chr05 BLAT cDNA_match 7331145 7331986 100.00 - . ID=blat.proc26359.chain_6281;Target=transcript/3505 614 1455 +

transcript/3505 (from __all_transcripts fasta):

transcript/3505 ATTATTTCCTTTTCAATATCTTTTCTGTTAGGTTTACATCCAGAACCACAACACGGCAAG ATGACGCAGCAACCAGCAAACGAAATGGCCATGGACCGCAACGCCTACACCAAGGCCATG TTTGGAGCTGACAAAGCCTTTGGTACACTTGTGGCTGCGGACAAGAAGCGGGTTGCAGAC AAGATGGTGGAGGACGAGATCAAGTCTGCTGGCTCCCTTGTCCCAGGATGTGAAGCTGCA CTCGCCAAGGCGGCAGCGGATGCCAGCTTGGAGGCCAAGGAAAGGGACGAAAACTTTGAC AGGGATGACTACTCCAAGCGGATTTTTGGACTGCCCTTCAAGCAGCTGTCTGCTGAGCAC AAGGAGCGTGTGGTCAATCAGATGATGATGGACGAGATGCATGCTGGTGGCCACGTGATT TCAGAGGAGACAGCGCAAGCAAAGGCCATAGCTGATGCCGAATACAAAGCGAGAAACATG CACGCGCCGCCTCCTTCAGGTCATTTCATGGTTGGTGGGCATGATCTCCGGAGCCCTGCA ACTCCTGGAGGAACAGCTGAAGCCAAGGGGGCTTCTGAGGAGGAGAAGAACAAGCAAAGC ACACCGCAAGCAGGGCACATGTTTGTCGCTGGGCACGATCTCAGGACCCCAGGCTCAGGA GGCGCAGGTGTCAGACCTGACGAGCTGAACGAAAAAGAGATGAATAAAGCTTTCTAGATC TGGCTTGTTGGTAAATTGGTGAAGGGGTGCACCTTTGCCTGCGCTATGTGCTTTGGCTAA CATGTGGTGAGGGGTCTGGCGTTTACGAGGAAGGCCAAGAAGGACAGCCTTCAATCTGTG CATTCTTGATTCACTTCTTGCATGTTTGCATTCTGAGTGCACAATTTCAGCACTCTTGTA CATTCAGTCAATACATGCTTTGGCTAGCAGGTGGAAGTATGCAACCAGTGCACTTGAAGA CAAATTCCAGTTAAGCAATTTGACGTACAAGGAAATCCAAAGAGAAGAGCCTTTGAATTT TGCACTCTTGATTTACTTCTTGCATGTTTGCATTTCATGCAAATTTGGATCACCCTCTTG TATATGACTTTGGTATGTGCTTTGGCTAGCAGGTGGGAGGATGCAAGGAATGAGTGTGTT TACGGTCCCACTGCCTTGTTGAGAGGAATAGGCCATCAGCGGCTGCCATTGGTCCCACTG CCTTGTTGAGAGGAATAGGCCATCAGCGGCTGCCATTGGGTGTGGAAATCAATAATTTTC TGTGGACGCATCAGGGCCATTTTTGTAGAGGTGCAAGGCAACCAGTATACTTGAAGACAT GTACCAAATAAGCAAGCATTTGATCCTTTTACAGATCTGAAACAAGTCTAAGGGCTTCTG CATTTTTACTTCTTTCATTTTGCATCTCAAACGCAAAATCCCAGCAACCTCTTCAACATT AAGTCAATGTTTGGC

transcript/2537 is the following:

transcript/2537 AAAGATAATTGAACAGGTATTATTTCCTTTTCAATATTCTTTTCTGTTAGGTTTACATCC AGAACCACAACACGGCAAGATGACGCAGCAACCAGCAAACGAAATGGCCATGGACCGCAA CGCCTACACCAAGGCCATGTTTGGAGCTGACAAAGCCTTTGGTACACTTGTGGCTGCGGA CAAGAAGCGGGTTGCAGACAAGATGGTGGAGGACGAGATCAAGTCTGCTGGCTCCCTTGT CCCAGGATGTGAAGCTGCACTCGCCAAGGCGGCAGCGGATGCCAGCTTGGAGGCCAAGGA AAGGGACGAAAACTTTGACAGGGATGACTACTCCAAGCGGATTTTTGGACTGCCCTTCAA GCAGCTGTCTGCTGAGCACAAGGAGCGTGTGGTCAATCAGATGATGATGGACGAGATGCA TGCTGGTGGCCACGTGATTTCAGAGGAGACAGCGCAAGCAAAGGCCATAGCTGATGCCGA ATACAAAGCGAGAAACATGCACGCGCCGCCTCCTTCAGGTCATTTCATGGTTGGTGGGCA TGATCTCCGGAGCCCTGCAACTCCTGGAGGAACAGCTGAAGCCAAGGGGGCTTCTGAGGA GGAGAAGAACAAGCAAAGCACACCGCAAGCAGGGCACATGTTTGTCGCTGGGCACGATCT CAGGACCCCAGGCTCAGGAGGCGCAGGTGTCAGACCTGACGAGCTGAACGAAAAAGAGAT GAATAAAGCTTTCTAGATCTGGCTTGTTGGTAAATTGGTGAAGGGGTGCACCTTTGCCTG CGCTATGTGCTTTGGCTAACATGTGGTGAGGGGTCTGGCGTTTACGAGGAAGGCCAAGAA GGACAGCCTTCAATCTGTGCATTCTTGATTCACTTCTTGCATGTTTGCATTCTGAGTGCA CAATTTCAGCACTCTTGTACATTCAGTCAATACATGCTTTGGCTAGCAGGTGGAAGTATG CAACCAGTGCACTTGAAGACAAATTCCAGTTAAGCAATTTGACGTACAAGGAAATCCAAA GAGAAGAGCCTTTGAATTTTGCACTCTTGATTTACTTCTTGCATGTTTGCATTTCATGCA AATTTGGATCACCCTCTTGTATATGACTTTGGTATGTGCTTTGGCTAGCAGGTGGGAGGA TGCAAGGAATGAGTGTGTTTACGGTCCCACTGCCTTGTTGAGAGGAATAGGCCATCAGCG GCTGCCATTGGTCCCACTGCCTTGTTGAGAGGAATAGGCCATCAGCGGCTGCCATTGGGT GTGGAAATCAATAATTTTCTGTGGACGCATCAGGGCCATTTTTGTAGAGGTGCAAGGCAA CCAGTATACTTGAAGACATGTACCAAATAAGCAAGCATTTGATCCTTTTACAGATCTGAA ACAAGTCTAAGGGCTTCTGCATTTTTACTTCTTTCATTTTGCATCTCAAACGCAAAATCC CAGCAACCTCTTCAACATTAAGTCAATGTTTGGCACTTGCGCGATGGATCAAGCTCCCAT GACATGAGCGGCGAAGCCAGCATCCAATTACAAGGAAGCAGGATGTGTGGTGTTTCAATT CAGATTGGCAATTCAGTTTGAAAGAGCTGAAGCACCTCATGCTCTTTCAAACAACCTGTA ACAGTGGAAAATGAG

nextgenusfs commented 4 years ago

Are there forward slashes in the fasta header?? ie transcript/3505? That could certainly be causing a problem somewhere.

eyalbenda commented 4 years ago

yes, based on __all_transcripts fasta there are 97 transcripts with a forward slash in them. Running the training pipeline without the iso-seq data doesn't result in any transcripts with that naming convention.

nextgenusfs commented 4 years ago

I would just rename the isoseq data with something simple and it will likely work.

eyalbenda commented 4 years ago

Thank you, I will try and report here back. Fyi, I just used the fasta output file from the standard iso-seq3 pipeline from pac-bio, so if this is indeed the issue I'd recommend either adding a preprocessing step or documentation about this.

On Tue, Dec 24, 2019, 11:00 Jon Palmer notifications@github.com wrote:

I would just rename the isoseq data with something simple and it will likely work.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/326?email_source=notifications&email_token=ACQHDYLME4DN7LCWOTY3PWLQ2JL4ZA5CNFSM4IZLK4YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHTRFGA#issuecomment-568791704, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQHDYI6JWKPFIGASLA7DGDQ2JL4ZANCNFSM4IZLK4YA .

nextgenusfs commented 4 years ago

Yeah if that fixes it then I can just add a fasta header rename step. There is one already but it must not be working properly.

eyalbenda commented 4 years ago

After replacing the forward slashes with underscores in the Iso-seq fastq file, funannotate train completes successfully. Thank you for your support, closing the issue.

nextgenusfs commented 4 years ago

Okay, added this via https://github.com/nextgenusfs/funannotate/commit/1ee07cb06fcaa595dc59f1ffcfa32aa8cff61598. Also updated the parser to something simpler, so should be a little faster in processing the long reads.