nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

Malformed GFF. #278

Closed olekto closed 5 years ago

olekto commented 5 years ago

Are you using the latest release? Yes, using 1.5.2

Describe the bug I am inputting PASA output via --pasa_gff, and getting an error.

What command did you issue? funannotate predict -i gadMor2_combined.fasta.masked -o div -s "Gadus morhua tr_is_st_pasa1" --busco_db actinopterygii --organism other \ --cpus 32 --busco_seed_species zebrafish --max_intronlen 100000 --header_length 16 \ --stringtie cod2_combined_stringtie_code_trim.gtf \ --rna_bam cod2_combined_hisat2_code_trim.sort.bam \ --transcript_evidence Trinity.fasta polished.hq.fasta \ --protein_alignments gadMor2_combined_proteins_gth_evm_ready.gff3 \ --pasa_gff pasa_trinity_isoseq.sqlite.pasa_assemblies.gff3 \ --repeats2evm

Logfiles Error, can't find ID or Parent. Malformed GFF file. LG01 pasa_pred cDNA_match 1000 3019 . - . ID=align_1757530;Target=asmbl_1 38 2057 +

OS/Install Information

You are running Perl v 5.026002. Now checking perl modules... Bio::Perl: 1.007002 Carp: 1.50 Clone: 0.41 DBD::SQLite: 1.60 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.843 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.00 LWP::UserAgent: 6.36 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.11 Text::Soundex: 3.05 Thread::Queue: 3.13 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.27 threads: 2.21 threads::shared: 1.59 All 27 Perl modules installed

Checking external dependencies... rmblastn: error while loading shared libraries: libblastinput.so: cannot open shared object file: No such file or directory RepeatMasker: RepeatMasker version development-$Id: RepeatMasker,v 1.332 2017/04/17 19:01:11 rhubley Exp $ RepeatModeler: RepeatModeler version DEV Trinity: 2.6.6 augustus: 3.2.3 bamtools: bamtools 2.4.1 bedtools: bedtools v2.27.1 blat: BLAT v36 diamond: diamond 0.9.24 emapper.py: emapper-0.12.7 ete3: 3.1.1 exonerate: exonerate 2.4.0 gmap: 2018-07-04 hisat2: 2.1.0 hmmscan: HMMER 3.2.1 (June 2018) hmmsearch: HMMER 3.2.1 (June 2018) java: 11.0.1 kallisto: 0.44.0 makeblastdb: makeblastdb 2.2.31+ minimap2: 2.15-r905 nucmer: 3.1 pslCDnaFilter: no way to determine samtools: samtools 1.9 stringtie: 1.3.4d tRNAscan-SE: 2.0 (December 2017) tbl2asn: unknown, likely 25.3 tblastn: tblastn 2.2.31+ trimal: trimAl v1.4.rev15 build[2013-12-17] ERROR: CodingQuarry not installed ERROR: fasta not installed ERROR: gmes_petap.pl not installed ERROR: mafft not installed ERROR: rmblastn not installed Checking Environmental Variables... $FUNANNOTATE_DB=/projects/cees/bin/funannotate/db $PASAHOME=/usit/abel/u1/olekto/miniconda2/opt/pasa-2.3.3 $TRINITYHOME=/usit/abel/u1/olekto/miniconda2/opt/trinity-2.6.6 $EVM_HOME=/usit/abel/u1/olekto/miniconda2/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/usit/abel/u1/olekto/miniconda2/config $GENEMARK_PATH=/projects/cees/bin/genemark/4.38/gm_et_linux_64/gmes_petap $BAMTOOLS_PATH=/usit/abel/u1/olekto/miniconda2/bin/ All 7 environmental variables are set

I don't understand this error. From the source code, it looks like the ID should have been discovered, but it is not apparently.

Should the PASA input file be processed in certain ways? Should it be the output is this for instance? https://github.com/PASApipeline/PASApipeline/wiki/PASA_abinitio_training_sets

Also, the StringTie input is not used in any manner, as far as I can see. Is this correct?

Thank you.

Ole

nextgenusfs commented 5 years ago

It is expecting the transdecoder GFF3 from PASA. This is the output that has been filtered for the best models by PASA.

olekto commented 5 years ago

Great, then I'll use that instead. Is it mentioned in the documentation? I might have missed it.

Thank you.

Ole

Den søn. 24. mar. 2019, 17:11 skrev Jon Palmer notifications@github.com:

It is expecting the transdecoder GFF3 from PASA. This is the output that has been filtered for the best models by PASA.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/278#issuecomment-475973463, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjn69usN74NmFXztAHccU5pKY6tHr1Tks5vZ6PKgaJpZM4cFlip .

nextgenusfs commented 5 years ago

It probably isn’t explicit enough. If you run funannotate train first it will do all this for you automatically.

olekto commented 5 years ago

Ok, thank you.

I can do anything funannotate train does automatically outside of funannotate on my own? This is a larger genome (700 Mbp) with lots of different data (substantial RNA-seq, but also IsoSeq, 454 RNA sequencing and Sanger), so I am testing different approaches to find what seems to work best. GenomeThreader mapping of proteins for instance, which seems to give much more mapped sequences than Diamond + Exonerate.

Den søn. 24. mar. 2019, 17:21 skrev Jon Palmer notifications@github.com:

It probably isn’t explicit enough. If you run funannotate train first it will do all this for you automatically.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/278#issuecomment-475974359, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjn6zekFWVgrnrKyjkobFAZTyE2XFd5ks5vZ6YVgaJpZM4cFlip .

nextgenusfs commented 5 years ago

The protein alignments generally aren't as informative for training Augustus as the transcript alignments -- if you have RNA-seq data, the protein alignments are probably negligible (I haven't tested this specifically). What you need to train Augustus accurately are the intron/exon boundaries -- so with protein alignments these are less precise, but that is why I use exonerate because you can be a little more specific on the splice sites (I've never used GenomeThreader so can't comment on that specifically either).

Stringtie is used if you have CodingQuarry installed -- although I don't think I would use it with your genome. It was written specifically for fungi and tends to over-predict gene models (somewhat fragmented) for larger genomes.

But yes you can run the commands manually.