nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

funannotate-p2g.py failed (?) #117

Closed gskrasnov closed 6 years ago

gskrasnov commented 6 years ago

Dear Jon,

I looked thoroughly at my previous funannotate results (derived a month ago) and found that funannotate-p2g.py seems to fail. As a result, finally, out of 500,000 proteins, tblastn had pre-filtered only found 349 potential alignments, that doesn't seem right, as you have mentioned previously.

Look at the file protein_alignments.gff3:

Only ten scaffold are mapped to UniProt (complete) database. jcf7180002610509, jcf7180002620504, jcf7180002551236, jcf7180002626442, jcf7180002625814, jcf7180002625434, jcf7180002623204, jcf7180002625879, jcf7180002625719

They are mapped to: 14332_ARATH, 14332_ORYSJ, 14331_ARATH, 14332_MAIZE, 14331_SOLLC, 14310_SOLLC, 14331_SOLTU, 14331_MAIZE The targets are expected (a.thaliana, oryza sativa, maize,... - I'm analyzing flax genome)

But this is surprising since all these entries are occurring in the very beginning of UniProt data file. An output seems to be truncated somewhere…

exonerate.out, p2g.tblastn.out, hints.P.gff, protein_alignments.gff3 seem to be incomplete.

I looked at funannotate-p2g.log and found:

2017-10-18 22:13:11,031: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-p2g.py -p 901.fun_out/predict_misc/proteins.combined.fa -g /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -o 901.fun_out/predict_misc/exonerate.out --maxintron 20000 --cpus 60 --ploidy 2 -f tblastn --tblastn_out 901.fun_out/predict_misc/p2g.tblastn.out --logfile 901.fun_out/logfiles/funannotate-p2g.log

2017-10-18 22:13:11,741: Using 543,411 proteins as queries
2017-10-18 22:13:11,742: BLAST v2.6.0+; Exonerate v2.2.0
2017-10-18 22:13:11,742: Running pre-filter tBLASTN step
2017-10-18 22:13:11,743: dustmasker -in /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -infmt fasta -parse_seqids -outfmt maskinfo_asn1_bin -out genome_dust.asnb
2017-10-18 22:13:54,066: makeblastdb -in /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -dbtype nucl -parse_seqids -mask_data genome_dust.asnb -out genome
2017-10-18 22:14:08,167: 

Building a new DB, current time: 10/18/2017 22:13:54
New DB name:   /mnt/raid/illumina/geo/Flax-2017.Annotation/p2g_38002/genome
New DB title:  /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Mask file: genome_dust.asnb
Adding sequences from FASTA; added 76180 sequences in 12.9544 seconds.

2017-10-18 22:14:08,167: tblastn -num_threads 60 -db genome -query /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/proteins.combined.fa -max_target_seqs 10 -db_soft_mask 11 -threshold 999 -max_intron_length 20000 -evalue 1e-10 -outfmt 6 -out filter.tblastn.tab
2017-10-18 22:14:14,125: Found 349 preliminary alignments
2017-10-18 22:14:37,092: Exonerate finished: found 19 alignments

Only 19 alignment have been found. Very strange. You can find exonerate.out here:

I decided to run the command above once more and got 137 alignments instead of 19.... :

2017-12-05 00:02:04,691: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-p2g.py -p 901.fun_out/predict_misc/proteins.combined.fa -g /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -o 901.fun_out/predict_misc/exonerate.out --maxintron 20000 --cpus 60 --ploidy 2 -f tblastn --tblastn_out 901.fun_out/predict_misc/p2g.tblastn.out --logfile 901.fun_out/logfiles/funannotate-p2g.log

2017-12-05 00:02:05,448: Using 543,411 proteins as queries
2017-12-05 00:02:05,449: BLAST v2.6.0+; Exonerate v2.2.0
2017-12-05 00:02:05,449: Running pre-filter tBLASTN step
2017-12-05 00:02:05,449: dustmasker -in /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -infmt fasta -parse_seqids -outfmt maskinfo_asn1_bin -out genome_dust.asnb
2017-12-05 00:02:47,880: makeblastdb -in /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -dbtype nucl -parse_seqids -mask_data genome_dust.asnb -out genome
2017-12-05 00:03:01,509: 

Building a new DB, current time: 12/05/2017 00:02:47
New DB name:   /mnt/raid/illumina/geo/Flax-2017.Annotation/p2g_33211/genome
New DB title:  /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Mask file: genome_dust.asnb
Adding sequences from FASTA; added 76180 sequences in 12.5124 seconds.

2017-12-05 00:03:01,509: tblastn -num_threads 60 -db genome -query /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/proteins.combined.fa -max_target_seqs 10 -db_soft_mask 11 -threshold 999 -max_intron_length 20000 -evalue 1e-10 -outfmt 6 -out filter.tblastn.tab
2017-12-05 00:03:14,306: Found 1,541 preliminary alignments
2017-12-05 00:03:41,508: Exonerate finished: found 137 alignments

You can find exonerate output file here:

Thanks in advance!

gskrasnov commented 6 years ago

Surprisingly, I reduced thread count from 60 to 8 and got 178 alignments instead of 137:

2017-12-05 00:20:07,119: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-p2g.py -p 901.fun_out/predict_misc/proteins.combined.fa -g /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -o 901.fun_out/predict_misc/exonerate.out --maxintron 20000 --cpus 8 --ploidy 2 -f tblastn --tblastn_out 901.fun_out/predict_misc/p2g.tblastn.out --logfile 901.fun_out/logfiles/funannotate-p2g.log

2017-12-05 00:20:07,876: Using 543,411 proteins as queries
2017-12-05 00:20:07,877: BLAST v2.6.0+; Exonerate v2.2.0
2017-12-05 00:20:07,877: Running pre-filter tBLASTN step
2017-12-05 00:20:07,877: dustmasker -in /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -infmt fasta -parse_seqids -outfmt maskinfo_asn1_bin -out genome_dust.asnb
2017-12-05 00:20:50,415: makeblastdb -in /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa -dbtype nucl -parse_seqids -mask_data genome_dust.asnb -out genome
2017-12-05 00:21:04,276: 

Building a new DB, current time: 12/05/2017 00:20:50
New DB name:   /mnt/raid/illumina/geo/Flax-2017.Annotation/p2g_38431/genome
New DB title:  /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/genome.softmasked.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Mask file: genome_dust.asnb
Adding sequences from FASTA; added 76180 sequences in 12.7572 seconds.

2017-12-05 00:21:04,277: tblastn -num_threads 8 -db genome -query /mnt/raid/illumina/geo/Flax-2017.Annotation/901.fun_out/predict_misc/proteins.combined.fa -max_target_seqs 10 -db_soft_mask 11 -threshold 999 -max_intron_length 20000 -evalue 1e-10 -outfmt 6 -out filter.tblastn.tab
2017-12-05 00:26:41,304: Found 9,512 preliminary alignments
2017-12-05 00:29:31,264: Exonerate finished: found 178 alignments
nextgenusfs commented 6 years ago

tblastn multithreaded is broken in most blast+ versions passed 2.2.31 (or something around there), surprisingly they still haven't fixed in 2.6.0 apparently. If you run it in single threaded mode it will work, but takes much too long. Solution is to install older version of tblastn and have it in $PATH before this 2.6.0 version. In the upcoming release I've changed all these searches to use Diamond, you can use diamond by passing --use_diamond.

gskrasnov commented 6 years ago

Thank you, I will try it

gskrasnov commented 6 years ago

I tried the most recent version of funannotate predict (1.0.0) and almost everything went OK. Funannotate suggests me to run PASA pipeline:

Your next step to capture UTRs and update annotation using PASA:

funannotate update -i 901.fun_out.wo.RepMod --cpus 60 \
        --left illumina_forward_RNAseq_R1.fastq.gz \
        --right illumina_forward_RNAseq_R2.fastq.gz \
        --jaccard clip

However, I have already run PASA previously and supplied "--pasa_gff ${DB_NAME}.assemblies.fasta.transdecoder.genome.gff3" to funannotate predict

Should I launch funannotate update now? Maybe I don't understand something...

Here is the complete log:

2017-12-21 11:44:16,857: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-predict.py -i 901.masurca.scf.longonly.fasta -o 901.fun_out.wo.RepMod --species Linum usitatissimum --pasa_gff Flax_901_PASA6.assemblies.fasta.transdecoder.genome.gff3 --rna_bam STAR.Flax_27_29_30.mapped.on.901.masurca.scf.longonly.sorted.bam --transcript_evidence Flax_27_29_30.901.merged.clean.fasta --cpus 60 --repeatmasker_species arabidopsis --busco_seed_species arabidopsis --busco_db embryophyta --organism other --ploidy 2 --max_intronlen 20000 --database /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/funannotate.DB --SeqCenter EIMB_CCU --name LU_

2017-12-21 11:44:16,874: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
2017-12-21 11:44:17,038: Running funannotate v1.0.0
2017-12-21 11:44:18,173: Augustus training set for linum_usitatissimum already exists, using existing parameters.
        If you want to re-train, provide a unique name for the --augustus_species argument
2017-12-21 11:44:19,792: AUGUSTUS (3.2.2) detected, version seems to be compatible with BRAKER1 and BUSCO
2017-12-21 11:44:22,946: /home/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/gff3_gene_prediction_file_validator.pl 901.fun_out.wo.RepMod/predict_misc/pasa_predictions.gff3
2017-12-21 11:44:58,432: Soft-masking: running RepeatMasker using arabidopsis species
2017-12-21 12:03:39,483: rmOutToGFF3.pl genome.fasta.out
2017-12-21 12:04:09,755: Masked genome: 4,836 scaffolds; 310,807,115 bp; 5.60% repeats masked
2017-12-21 12:04:10,481: Aligning transcript evidence to genome with GMAP
2017-12-21 12:17:04,550: 214,063 transcripts aligned with GMAP
2017-12-21 12:17:04,551: Aligning transcript evidence to genome with BLAT
2017-12-21 12:17:04,551: blat -noHead -minIdentity=80 -maxIntron=20000 /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/genome.softmasked.fa 901.fun_out.wo.RepMod/predict_misc/transcripts.combined.fa 901.fun_out.wo.RepMod/predict_misc/blat.psl
2017-12-21 14:35:37,743: Loaded 310807115 letters in 4836 sequences
Searched 268602224 bases in 216742 sequences

2017-12-21 14:35:37,744: pslCDnaFilter -minId=0.9 -localNearBest=0.005 -ignoreNs -bestOverlap 901.fun_out.wo.RepMod/predict_misc/blat.psl 901.fun_out.wo.RepMod/predict_misc/blat.filt.psl
2017-12-21 14:35:47,576:                        seqs    aligns
             total: 214140  1670218
        weird over: 102 252
     drop minIdent: 66356   474795
      drop overlap: 8072    39428
    drop localBest: 187037  894837
        kept weird: 102 120
              kept: 214007  261158

2017-12-21 14:35:47,577: sort -n -k 16,16 901.fun_out.wo.RepMod/predict_misc/blat.filt.psl
2017-12-21 14:35:48,073: sort -s -k 14,14 901.fun_out.wo.RepMod/predict_misc/blat.sort.tmp.psl
2017-12-21 14:35:49,210: /home/linuxbrew/.linuxbrew/opt/augustus/libexec/scripts/blat2hints.pl --in=901.fun_out.wo.RepMod/predict_misc/blat.sort.psl --out=901.fun_out.wo.RepMod/predict_misc/hints.E.gff --minintronlen=20 --trunkSS
2017-12-21 14:36:34,107: 
processed line 1
...........
processed line 261001

2017-12-21 14:36:34,172: 261,158 filtered BLAT alignments
2017-12-21 14:36:46,969: Mapping proteins to genome using Diamond blastx/Exonerate
2017-12-21 15:52:33,869: /home/linuxbrew/.linuxbrew/opt/augustus/libexec/scripts/exonerate2hints.pl --in=901.fun_out.wo.RepMod/predict_misc/exonerate.out --out=901.fun_out.wo.RepMod/predict_misc/hints.P.gff --minintronlen=10 --maxintronlen=20000
2017-12-21 15:52:36,113: Now launching BRAKER to train GeneMark and Augustus
2017-12-21 15:52:36,114: braker.pl --workingdir 901.fun_out.wo.RepMod/predict_misc --cores 60 --AUGUSTUS_CONFIG_PATH=/home/linuxbrew/.linuxbrew/opt/augustus/libexec/config --BAMTOOLS_PATH=/home/linuxbrew/.linuxbrew/opt/bamtools/bin --GENEMARK_PATH=/usr/local/gmes_petap --gff3 --softmasking 1 --genome=/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/genome.softmasked.fa --species=linum_usitatissimum --bam=/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/STAR.Flax_27_29_30.mapped.on.901.masurca.scf.longonly.sorted.bam --useexisting
2017-12-21 17:13:24,996: perl /home/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/misc/augustus_GTF_to_EVM_GFF3.pl 901.fun_out.wo.RepMod/predict_misc/braker/linum_usitatissimum/augustus.gff
2017-12-21 17:13:52,021: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/util/genemark_gtf2gff3.pl 901.fun_out.wo.RepMod/predict_misc/braker/linum_usitatissimum/GeneMark-ET/genemark.gtf
2017-12-21 17:14:01,677: perl /home/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/misc/augustus_GFF3_to_EVM_GFF3.pl 901.fun_out.wo.RepMod/predict_misc/genemark.gff
2017-12-21 17:14:28,422: Pulling out high quality Augustus predictions
2017-12-21 17:14:31,155: Found 22,099 high quality predictions from Augustus (>90% exon evidence)
2017-12-21 17:17:34,859: Summary of gene models passed to EVM (weights):
-------------------------------------------------------
Augustus models (1):    70,522 
GeneMark models (1):    68,286
Hi-Q models (5):    25,128
PASA gene models (10):  64,232
Other gene models (1):  0
Total gene models:  228,168
-------------------------------------------------------
2017-12-21 17:46:43,691: 74,668 total gene models from EVM
2017-12-21 17:46:43,980: Generating protein fasta files from 74,668 EVM models
2017-12-21 17:46:43,980: /home/linuxbrew/.linuxbrew/opt/evidencemodeler/EvmUtils/gff3_file_to_proteins.pl /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/evm.round1.gff3 /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/genome.softmasked.fa
2017-12-21 17:48:02,357: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
2017-12-21 17:48:02,357: diamond blastp --sensitive --query 901.fun_out.wo.RepMod/predict_misc/evm.round1.proteins.fa --threads 60 --out 901.fun_out.wo.RepMod/predict_misc/repeats.xml --db /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/funannotate.DB/repeats.dmnd --evalue 1e-10 --max-target-seqs 1 --outfmt 5
2017-12-21 17:48:23,253: diamond v0.9.14.115 | by Benjamin Buchfink <buchfink@gmail.com>
Licensed under the GNU AGPL <https://www.gnu.org/licenses/agpl.txt>
Check http://github.com/bbuchfink/diamond for updates.

#CPU threads: 60
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
#Target sequences to report alignments for: 1
Temporary directory: 901.fun_out.wo.RepMod/predict_misc
Opening the database...  [3.8e-05s]
Opening the input file...  [3.2e-05s]
Opening the output file...  [2.7e-05s]
Loading query sequences...  [0.15264s]
Masking queries...  [0.593554s]
Building query seed set...  [0.000486s]
Algorithm: Double-indexed
Building query histograms...  [0.300435s]
Allocating buffers...  [0.003775s]
Loading reference sequences...  [0.025333s]
Building reference histograms...  [0.121616s]
Allocating buffers...  [0.003702s]
Initializing temporary storage...  [0.014263s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index...  [0.125995s]
Building query index...  [0.639825s]
Building seed filter...  [0.032113s]
Searching alignments...  [1.36441s]
.............
Building reference index...  [0.048288s]
Building query index...  [0.059877s]
Building seed filter...  [0.023934s]
Searching alignments...  [0.063656s]
Deallocating buffers...  [0.000307s]
Computing alignments...  [3.53125s]
Deallocating reference...  [5e-06s]
Loading reference sequences...  [2.2e-05s]
Deallocating buffers...  [0.000789s]
Deallocating queries...  [0.000269s]
Loading query sequences...  [1.5e-05s]
Closing the input file...  [8e-06s]
Closing the output file...  [5.6e-05s]
Closing the database file...  [2e-06s]
Total time = 20.8472s
Reported 10348 pairwise alignments, 10586 HSPs.
10348 queries aligned.

2017-12-21 17:48:33,436: bedtools intersect -f 0.9 -a /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/evm.round1.gff3 -b /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/repeatmasker.gff3
2017-12-21 17:48:35,458: Found 10,391 gene models to remove: 1 too short; 61 span gaps; 10,348 transposable elements
2017-12-21 17:48:38,856: 64,277 gene models remaining
2017-12-21 17:48:38,857: Predicting tRNAs
2017-12-21 17:48:38,857: tRNAscan-SE -o 901.fun_out.wo.RepMod/predict_misc/tRNAscan.out /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/genome.softmasked.fa
2017-12-21 18:13:46,033: Use of assignment to $[ is deprecated at /home/linuxbrew/.linuxbrew/bin/tRNAscan-SE line 3563.

tRNAscan-SE v.1.23 (April 2002) - scan sequences for transfer RNAs

  Please cite: 
    Lowe, T.M. & Eddy, S.R. (1997) "tRNAscan-SE: A program for
    improved detection of transfer RNA genes in genomic sequence"
    Nucl. Acids Res. 25: 955-964.

  This program uses a modified, optimized version of tRNAscan v1.3
  (Fichant & Burks, J. Mol. Biol. 1991, 220: 659-671),
  a new implementation of a multistep weight matrix algorithm
  for identification of eukaryotic tRNA promoter regions
  (Pavesi et al., Nucl. Acids Res. 1994, 22: 1247-1256),
  as well as the RNA covariance analysis package Cove v.2.4.2
  (Eddy & Durbin, Nucl. Acids Res. 1994, 22: 2079-2088).

------------------------------------------------------------
Sequence file(s) to search:  /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/predict_misc/genome.softmasked.fa
Search Mode:                 Eukaryotic
Results written to:          901.fun_out.wo.RepMod/predict_misc/tRNAscan.out
Output format:               Tabular
Searching with:              tRNAscan + EufindtRNA -> Cove
Covariance model:            TRNA2-euk.cm
tRNAscan parameters:         Strict
EufindtRNA parameters:       Relaxed (Int Cutoff= -32.1)
------------------------------------------------------------

2017-12-21 18:13:46,127: Found 1,127 tRNA gene models
2017-12-21 18:13:46,128: bedtools intersect -v -a 901.fun_out.wo.RepMod/predict_misc/trnascan.gff3 -b 901.fun_out.wo.RepMod/predict_misc/evm.cleaned.gff3
2017-12-21 18:13:47,858: 978 tRNAscan models are valid (non-overlapping)
2017-12-21 18:13:47,858: Generating GenBank tbl annotation file
2017-12-21 18:13:59,188: Converting to final Genbank format
2017-12-21 18:13:59,366: tbl2asn -y "Annotated using funannotate v1.0.0" -N 1 -p 901.fun_out.wo.RepMod/predict_misc/tbl2asn -t /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/lib/test.sbt -M n -Z 901.fun_out.wo.RepMod/predict_results/Linum_usitatissimum.discrepency.report.txt -j "[organism=Linum usitatissimum]" -V b -c fx -T -a r10u -l paired-ends
2017-12-21 18:27:40,095: [tbl2asn] This copy of tbl2asn is more than a year old.  Please download the current version.
[tbl2asn] Flatfile genome

[tbl2asn] Validating genome

2017-12-21 18:28:18,121: Collecting final annotation files for 65,255 total gene models
2017-12-21 18:28:18,122: Funannotate predict is finished, output files are in the 901.fun_out.wo.RepMod/predict_results folder
2017-12-21 18:28:18,125: There are 4 gene models that need to be fixed.
2017-12-21 18:28:18,126: Manually edit the tbl file 901.fun_out.wo.RepMod/predict_results/Linum_usitatissimum.tbl, then run:

funannotate fix -i 901.fun_out.wo.RepMod/predict_results/Linum_usitatissimum.gbk -t 901.fun_out.wo.RepMod/predict_results/Linum_usitatissimum.tbl

2017-12-21 18:28:18,126: After the problematic gene models are fixed, you can proceed with functional annotation.
2017-12-21 18:28:18,126: Your next step to capture UTRs and update annotation using PASA:

funannotate update -i 901.fun_out.wo.RepMod --cpus 60 \
        --left illumina_forward_RNAseq_R1.fastq.gz \
        --right illumina_forward_RNAseq_R2.fastq.gz \
        --jaccard clip
nextgenusfs commented 6 years ago

Thanks for giving this a try! Happy to hear it looks like it is working. I'm trying to push a release but got caught up for nearly a week trying to get a docker container setup to run funannotate (which is a nightmare and far worse than manually installing all the tools locally... but I was trying to make it more accessible to users...). At any rate, it looks like all is working as it should.

The end message in funannotate predict changes depending on the input data, since you input a BAM file, it knows you have RNAseq data and thus is suggesting that you run funannotate update. The funannotate update script is effectively a wrapper for PASA mediated annotation comparison and subsequent updating. The consensus gene models predicted by Evidence Modeler typically do not have UTR information, since you have RNAseq, you likely have that information and thus funannotate update will use the RNAseq data to extend UTRs and fix any gene models that are in disagreement with the RNAseq data. The script can also update annotation directly from an NCBI gen bank flat file, i.e. say you have previously published an annotated genome and now later on you have RNAseq data --> you can feed those data to funannotate update to fix/remove/add gene models supported by the RNA data.

The corresponding command funannotate train is meant to be run prior to predict, i.e. train -> predict -> update if you were starting from scratch. However, since you already ran Trinity/PASA prior to predict, you can run funannotate update like so to use your pre-existing PASA database:

funannotate update -i 901.fun_out.wo.RepMod --cpus 60 \
        --left illumina_forward_RNAseq_R1.fastq.gz \
        --right illumina_forward_RNAseq_R2.fastq.gz \
        --pasa_conf yourAlignAssemblyPASA.txt
        --trinity Flax_27_29_30.901.merged.clean.fasta
        --no_normalize_reads --no_trimmomatic
        --max_intronlen 20000 

So you need to provide the RNAseq data, because after the script runs 2X rounds of PASA comparison pipeline it will output updated gene models which is in the form of typically several transcripts per each locus. To pick the transcript that is the "best" it uses Kallisto to map the RNAseq reads to all of the transcripts produced by PASA and then uses the relative mapping values to choose the transcript model with most reads mapped to it. If the data is already quality trimmed then pass --no_trimmomatic and likely you don't want to run Trinity read normalization at this point so pass --no_normalize_reads. The script will parse your PASA configuration file and pull out the MySQL database name, it will then validate that the fasta headers in the database match those on your genome, if they match it will use that existing dataset. If they don't match, it will re-run the alignment PASA steps (which will take some time).

Again thanks for giving it a try and happy to hear that it seems to be working. Let me know if you have any more problems or questions. I'm working on a manual/docs so hopefully will be someplace to look this stuff up.

Jon

gskrasnov commented 6 years ago

Thank you very much, Jon Now it's all clear to me. I'll try to launch funannotate update with these options and report the run result.

George

gskrasnov commented 6 years ago

funannotate update has suddenly failed. This is the error description:

Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-update.py", line 1458, in <module>
    GFF2tblCombined(BestModelGFF, fastaout, cleanTRNA, finalProts, locustag, genenumber, justify, args.SeqCenter, args.S
eqAccession, TBLFile)
  File "/home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-update.py", line 861, in GFF2tblComb
ined
    codon_start = int(v['phase'][indexStart[0]]) + 1
ValueError: invalid literal for int() with base 10: '.'

here is funannotate-update.log :

2017-12-22 00:45:37,447: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-update.py -i 901.fun_out.wo.RepMod --cpus 60 --single Flax_27_29_30.fastq --jaccard_clip --pasa_gff Flax_901_PASA6.assemblies.fasta.transdecoder.genome.gff3 --sbt template_flax_genome_2017.11.16.mod.sbt --stranded R --trinity Flax_27_29_30.901.merged.clean.fasta --no_normalize_reads --no_trimmomatic --memory 256G --max_intronlen 15000 --SeqCenter EIMB_CCU --name LU_ --species Linum usitatissimum --pasa_config pasa.alignAssembly.2nd.round.txt

2017-12-22 00:45:37,464: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
2017-12-22 00:45:37,628: Running funannotate v1.0.0
2017-12-22 00:47:12,155: Reannotating Linum usitatissimum, NCBI accession: None
2017-12-22 00:47:12,357: Previous annotation consists of: 64,277 protein coding gene models and 978 non-coding gene models
2017-12-22 00:47:12,358: (None, None, '/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq')
2017-12-22 00:47:12,358: Trimmomatic will be skipped
2017-12-22 00:47:12,358: (None, None, '/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq')
2017-12-22 00:47:12,358: Read normalization will be skipped
2017-12-22 00:47:12,358: (None, None, '/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq')
2017-12-22 00:47:21,352: Clipped 741 poly-A tails from transcripts
2017-12-22 00:47:21,586: Using Kallisto TPM data to determine which PASA gene models to select at each locus
2017-12-22 00:47:21,587: Building Kallisto index
2017-12-22 00:47:21,587: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/misc_utilities/gff3_file_to_proteins.pl 901.fun_out.wo.RepMod/update_misc/pasa_final.gff3 901.fun_out.wo.RepMod/update_misc/genome.fa cDNA
2017-12-22 00:48:48,925: kallisto index -i 901.fun_out.wo.RepMod/update_misc/getBestModel/bestModel 901.fun_out.wo.RepMod/update_misc/getBestModel/transcripts.fa
2017-12-22 00:51:28,728: 
[build] loading fasta file 901.fun_out.wo.RepMod/update_misc/getBestModel/transcripts.fa
[build] k-mer length: 31
[build] warning: replaced 8406 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 660399 contigs and contains 46649520 k-mers 

2017-12-22 00:51:28,729: Mapping reads using pseudoalignment in Kallisto
2017-12-22 00:51:28,729: kallisto quant -i 901.fun_out.wo.RepMod/update_misc/getBestModel/bestModel -o 901.fun_out.wo.RepMod/update_misc/getBestModel/kallisto --plaintext -t 60 --single -l 200 -s 20 /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq
2017-12-22 00:52:50,489: 
[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 64,232
[index] number of k-mers: 46,649,520
[index] number of equivalence classes: 136,619
[quant] running in single-end mode
[quant] will process file 1: /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 76,335,403 reads, 62,407,757 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,216 rounds

2017-12-22 00:52:51,519: Parsing Kallisto results and selecting best gene model
2017-12-22 00:52:51,519: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/misc_utilities/gff3_file_to_proteins.pl 901.fun_out.wo.RepMod/update_misc/pasa_final.gff3 901.fun_out.wo.RepMod/update_misc/genome.fa prot
2017-12-22 00:54:17,844: 0 models less than minimum protein length
2017-12-22 00:54:27,235: Wrote 64,232 gene models to 901.fun_out.wo.RepMod/update_misc/bestmodels.gff3
2017-12-22 00:54:27,258: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/misc_utilities/gff3_file_to_proteins.pl 901.fun_out.wo.RepMod/update_misc/bestmodels.gff3 901.fun_out.wo.RepMod/update_misc/genome.fa prot
2017-12-22 00:55:51,203: bedtools intersect -v -a 901.fun_out.wo.RepMod/update_misc/genome.trna.gff3 -b 901.fun_out.wo.RepMod/update_misc/bestmodels.gff3

I believe my report will improve funannotate

nextgenusfs commented 6 years ago

Great thanks. I'll have a look. So you have only single reads and it seems like the PASA steps got skipped, probably a logic error with only single reads. And to confirm, you have the current tip master repository?

gskrasnov commented 6 years ago

I downloaded funannotate from the master repository two days ago. To confirm, I shared libexec/bin folder with funannotate pythons here: http://194.226.21.15/temp/libexec/bin/

I'm using this version

nextgenusfs commented 6 years ago

Okay, so the --pasa_gff you are specifying here isn't what this script is expecting:

funannotate-update.py -i 901.fun_out.wo.RepMod --cpus 60 --single Flax_27_29_30.fastq \
   --jaccard_clip --pasa_gff Flax_901_PASA6.assemblies.fasta.transdecoder.genome.gff3 \
   --sbt template_flax_genome_2017.11.16.mod.sbt --stranded R \
   --trinity Flax_27_29_30.901.merged.clean.fasta --no_normalize_reads \
   --no_trimmomatic --memory 256G --max_intronlen 15000 \
   --SeqCenter EIMB_CCU --name LU_ --species Linum usitatissimum \
   --pasa_config pasa.alignAssembly.2nd.round.txt

It bypassed gene updating because you passed that variable. I had included that largely for testing purposes. Can you try again and leave --pasa_gff option off?

nextgenusfs commented 6 years ago

I will remove those options from the menu so that it isn't confusing (I realize that is same option as in predict and would make sense to add it, but the script will use the existing mysql database and get that info from the --pasa_config variable). It was an option I added while writing the downstream code from that step, same with the --kallisto option.

gskrasnov commented 6 years ago

I removed --stranded R (because it's not true) and --pasa_gff options. But unfortunately the same error has occurred:

Traceback (most recent call last):
  File "/home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-update.py", line 1458, in <module>
    GFF2tblCombined(BestModelGFF, fastaout, cleanTRNA, finalProts, locustag, genenumber, justify, args.SeqCenter, args.S
eqAccession, TBLFile)
  File "/home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-update.py", line 861, in GFF2tblComb
ined
    codon_start = int(v['phase'][indexStart[0]]) + 1
ValueError: invalid literal for int() with base 10: '.'

Should I use here just the same pasa_config file as I used for the 'first pass' PASA ? I mean that mysql database name should match the previous one (Flax_901_PASA6) or it will be created de novo and I should provide the new DB name?

gskrasnov commented 6 years ago

I removed folders with incomplete previous analysis (update_results and update_misc) and got the following:

funannotate update -i $FUN_OUT_DIR --cpus 60 --single $fastq_file -
-jaccard_clip --sbt template_flax_genome_2017.11.16.mod.sbt --trinity $TRANSCRIPTS_SEQ  --no_normalize_reads --no_trimmo
matic --memory 256G --max_intronlen 15000 --SeqCenter EIMB_CCU --name LU_ --species "Linum usitatissimum" --pasa_config
pasa.alignAssembly.2nd.round.txt
-------------------------------------------------------
[01:41:13 AM]: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
[01:41:13 AM]: Running funannotate v1.0.0
[01:42:50 AM]: Reannotating Linum usitatissimum, NCBI accession: None
[01:42:50 AM]: Previous annotation consists of: 64,277 protein coding gene models and 978 non-coding gene models
[01:42:50 AM]: Trimmomatic will be skipped
[01:42:50 AM]: Read normalization will be skipped
[01:42:59 AM]: Clipped 741 poly-A tails from transcripts
[01:43:00 AM]: Existing PASA database contains 0 gene models, validated FASTA headers match
[01:43:01 AM]: Running PASA annotation comparison step 1
[01:43:01 AM]: PASA failed, check log, exiting

Now I supplied funannotate with just previous PASA MySQL DB name and it seems that funannotate is still running fine. This thep has passed

gskrasnov commented 6 years ago

I think that funannotate would benefit if you provide --cpus option separately for computing steps (that do not intensively read/write to disk; e.g "CPU threads") and IO-consuming steps (e.g. "IO threads"). The second one should be significantly lower. For example, now it's running cDNA_annotation_comparer from PASA which is supplied with --CPU 60 option. However, htop demonstrates very low load of each cpu core (less than 10%) with total CPU usage about 240% instead of 6000%+. Status of cDNA_annotation_comparer is either S or D. I believe that parallel 60 IO operations (or MySQL queries) causes this.

Maybe I'm wrong here - there are only 10-12 cDNA_annotation_comparer.dbi threads/processes and total disk write/read is only about 500-800 K/s

Separating IO and CPU threads is great when sorting BAM files, for example

nextgenusfs commented 6 years ago

Yes, so in first run it was using the PASA files it found in the update_misc directory hence the same error. There should be some more info about the PASA failure in the log file. So the --pasa_config you pass should have the correctly database name in the MYSQLDB= field of the config file. Note that if you run this in the way funannotate intends it is much easier, i.e. train, predict, and then the command for update is simply just funannotate update -i youroutputdir as it will automatically find the relevant files from the output of funannotate train.

About the CPUs, yes I think that PASA has a limitation of about 6-12 CPUs and after that you won't get much performance improvement. I could throttle down those script calls to i.e. 12 if more CPUs is actually causing it to slow down (it might be??).

nextgenusfs commented 6 years ago

And the more I'm looking at this more closely i think that the alignment step should be CPUS/2 because PASA splits the threads between the two aligners.

gskrasnov commented 6 years ago

Exactly, PASA does align with two aligners. And one should supply PASA with cpu_count/2.

The --ALIGNERS can take values gmap, blat, or gmap,blat, in which case both aligners will be executed in parallel. The CPU setting determines the number of threads to be used for each process. This is passed on to GMAP to indicate the thread count. In the case of BLAT, the transcript database is split into CPU number of partitions and each partition is searched separately and in parallel using BLAT. Also, note that if gmap,blat is specified, then you may have up to 2*CPU number of processes running simultaneously. // source

funannotate update has successfully finished. Thank you for your pipeline! However, I'm a bit confisung with errors like "[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002626458:<2404879->2404950, relative line 1031368". There are about 1000 such errors. Here is the funnanotate update log file:

2017-12-22 01:44:16,781: /home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/bin/funannotate-update.py -i 901.fun_out.wo.RepMod --cpus 60 --single Flax_27_29_30.fastq --jaccard_clip --sbt template_flax_genome_2017.11.16.mod.sbt --trinity Flax_27_29_30.901.merged.clean.fasta --no_normalize_reads --no_trimmomatic --memory 256G --max_intronlen 15000 --SeqCenter EIMB_CCU --name LU_ --species Linum usitatissimum --pasa_config pasa.alignAssembly.txt

2017-12-22 01:44:16,798: OS: linux2, 64 cores, ~ 1057 GB RAM. Python: 2.7.12
2017-12-22 01:44:16,962: Running funannotate v1.0.0
2017-12-22 01:45:51,280: Reannotating Linum usitatissimum, NCBI accession: None
2017-12-22 01:45:51,481: Previous annotation consists of: 64,277 protein coding gene models and 978 non-coding gene models
2017-12-22 01:45:51,482: (None, None, '/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq')
2017-12-22 01:45:51,482: Trimmomatic will be skipped
2017-12-22 01:45:51,482: (None, None, '/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq')
2017-12-22 01:45:51,482: Read normalization will be skipped
2017-12-22 01:45:51,482: (None, None, '/mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq')
2017-12-22 01:46:00,662: Clipped 741 poly-A tails from transcripts
2017-12-22 01:46:00,664: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/pasa_asmbl_genes_to_GFF3.dbi -M Flax_901_PASA6:localhost -p george:LJT5pvaVvOff8efc
2017-12-22 01:47:21,489: Existing PASA database contains 93,635 gene models, validated FASTA headers match
2017-12-22 01:47:21,490: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/bin/cdbfasta 901.fun_out.wo.RepMod/update_misc/genome.fa
2017-12-22 01:47:22,488: 4836 entries from file 901.fun_out.wo.RepMod/update_misc/genome.fa were indexed in file 901.fun_out.wo.RepMod/update_misc/genome.fa.cidx

2017-12-22 01:47:22,489: Running PASA annotation comparison step 1
2017-12-22 01:47:22,490: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/Launch_PASA_pipeline.pl -c /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/pasa/annotCompare.txt -g /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa -t /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/trinity.fasta -A -L --annots_gff3 /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.gff3 --CPU 60
2017-12-22 05:35:04,068: 

## Processing CMD: 
01:47:22    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/Load_Current_Gene_Annotations.dbi -c /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/pasa/annotCompare.txt -g /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa -P /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.gff3  > pasa_run.log.dir/output.annot_loading.2975.out

## Processing CMD: 
01:49:38    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/cDNA_annotation_comparer.dbi -G /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa --CPU 60 -M Flax_901_PASA6  > pasa_run.log.dir/Flax_901_PASA6.annotation_compare.2975.out

## Processing CMD: 
05:28:33    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/dump_valid_annot_updates.dbi -M Flax_901_PASA6 -V -R -g /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa > Flax_901_PASA6.gene_structures_post_PASA_updates.2975.gff3

## Processing CMD: 
05:34:26    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/../misc_utilities/gff3_file_to_bed.pl Flax_901_PASA6.gene_structures_post_PASA_updates.2975.gff3 > Flax_901_PASA6.gene_structures_post_PASA_updates.2975.bed

##########################################################################
Finished.  Please visit the Assembly and Annotation Comparison results at:
http://pasa-dev.tigr.org/cgi-bin//status_report.cgi?db=Flax_901_PASA6
##########################################################################

2017-12-22 05:35:04,069: 
Warning, no genes retrieved for contig_id: jcf7180002551065
.......................
Warning, no genes retrieved for contig_id: jcf7180002626471
Warning, no genes retrieved for contig_id: jcf7180002626472

2017-12-22 05:35:04,069: Running PASA annotation comparison step 2
2017-12-22 05:35:04,070: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/Launch_PASA_pipeline.pl -c /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/pasa/annotCompare.txt -g /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa -t /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/trinity.fasta -A -L --annots_gff3 /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/pasa/Flax_901_PASA6.gene_structures_post_PASA_updates.2975.gff3 --CPU 60
2017-12-22 10:20:01,673: 

## Processing CMD: 
05:35:04    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/Load_Current_Gene_Annotations.dbi -c /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/pasa/annotCompare.txt -g /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa -P /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/pasa/Flax_901_PASA6.gene_structures_post_PASA_updates.2975.gff3  > pasa_run.log.dir/output.annot_loading.10029.out

## Processing CMD: 
05:37:41    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/cDNA_annotation_comparer.dbi -G /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa --CPU 60 -M Flax_901_PASA6  > pasa_run.log.dir/Flax_901_PASA6.annotation_compare.10029.out

## Processing CMD: 
10:12:54    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/dump_valid_annot_updates.dbi -M Flax_901_PASA6 -V -R -g /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/901.fun_out.wo.RepMod/update_misc/genome.fa > Flax_901_PASA6.gene_structures_post_PASA_updates.10029.gff3

## Processing CMD: 
10:19:23    CMD: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/scripts/../misc_utilities/gff3_file_to_bed.pl Flax_901_PASA6.gene_structures_post_PASA_updates.10029.gff3 > Flax_901_PASA6.gene_structures_post_PASA_updates.10029.bed

##########################################################################
Finished.  Please visit the Assembly and Annotation Comparison results at:
http://pasa-dev.tigr.org/cgi-bin//status_report.cgi?db=Flax_901_PASA6
##########################################################################

2017-12-22 10:20:01,674: 
Warning, no genes retrieved for contig_id: jcf7180002551065
........................................
Warning, no genes retrieved for contig_id: jcf7180002626472

2017-12-22 10:20:01,675: copying final PASA GFF3 to output: 901.fun_out.wo.RepMod/update_misc/pasa/Flax_901_PASA6.gene_structures_post_PASA_updates.10029.gff3
2017-12-22 10:20:01,896: Using Kallisto TPM data to determine which PASA gene models to select at each locus
2017-12-22 10:20:01,897: Building Kallisto index
2017-12-22 10:20:01,897: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/misc_utilities/gff3_file_to_proteins.pl 901.fun_out.wo.RepMod/update_misc/pasa_final.gff3 901.fun_out.wo.RepMod/update_misc/genome.fa cDNA
2017-12-22 10:21:27,956: kallisto index -i 901.fun_out.wo.RepMod/update_misc/getBestModel/bestModel 901.fun_out.wo.RepMod/update_misc/getBestModel/transcripts.fa
2017-12-22 10:24:52,970: 
[build] loading fasta file 901.fun_out.wo.RepMod/update_misc/getBestModel/transcripts.fa
[build] k-mer length: 31
[build] warning: replaced 1402 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 991286 contigs and contains 66128426 k-mers 

2017-12-22 10:24:52,970: Mapping reads using pseudoalignment in Kallisto
2017-12-22 10:24:52,971: kallisto quant -i 901.fun_out.wo.RepMod/update_misc/getBestModel/bestModel -o 901.fun_out.wo.RepMod/update_misc/getBestModel/kallisto --plaintext -t 60 --single -l 200 -s 20 /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq
2017-12-22 10:26:23,332: 
[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 76,291
[index] number of k-mers: 66,128,426
[index] number of equivalence classes: 186,651
[quant] running in single-end mode
[quant] will process file 1: /mnt/raid/illumina/geo/Flax-2017.Annotation/funannotate/Flax_27_29_30.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 76,335,403 reads, 67,509,564 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,050 rounds

2017-12-22 10:26:24,233: Parsing Kallisto results and selecting best gene model
2017-12-22 10:26:24,233: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/misc_utilities/gff3_file_to_proteins.pl 901.fun_out.wo.RepMod/update_misc/pasa_final.gff3 901.fun_out.wo.RepMod/update_misc/genome.fa prot
2017-12-22 10:27:51,930: 0 models less than minimum protein length
2017-12-22 10:28:00,451: Wrote 64,104 gene models to 901.fun_out.wo.RepMod/update_misc/bestmodels.gff3
2017-12-22 10:28:00,470: /mnt/raid/illumina/geo/progs/PASApipeline-pasa-v2.2.0/misc_utilities/gff3_file_to_proteins.pl 901.fun_out.wo.RepMod/update_misc/bestmodels.gff3 901.fun_out.wo.RepMod/update_misc/genome.fa prot
2017-12-22 10:29:13,348: bedtools intersect -v -a 901.fun_out.wo.RepMod/update_misc/genome.trna.gff3 -b 901.fun_out.wo.RepMod/update_misc/bestmodels.gff3
2017-12-22 10:29:28,292: Converting to Genbank format
2017-12-22 10:29:28,487: tbl2asn -y "Annotated using funannotate v1.0.0" -N 1 -p 901.fun_out.wo.RepMod/update_misc/tbl2asn -t template_flax_genome_2017.11.16.mod.sbt -M n -Z 901.fun_out.wo.RepMod/update_results/Linum_usitatissimum.discrepency.report.txt -j "[organism=Linum usitatissimum]" -V b -c fx -T -a r10u -l paired-ends
2017-12-22 10:43:11,888: [tbl2asn] This copy of tbl2asn is more than a year old.  Please download the current version.
[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002550916:c>86499-<86420, relative line 307
[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002550919:<70429->70501, relative line 1784
......................................
[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002626458:<2404879->2404950, relative line 1031368
[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002626458:<2460565->2460638, relative line 1031537
[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002626458:c>3017198-<3017125, relative line 1033227
[tbl2asn] Flatfile genome

[tbl2asn] Validating genome

2017-12-22 10:43:12,758: Collecting final annotation files
2017-12-22 10:43:50,921: Parsing GenBank files...comparing annotation
2017-12-22 10:45:39,742: Updated annotation complete:
-------------------------------------------------------
Total Gene Models:  65,040
New Gene Models:    0
No Change in Model: 36,466
Update UTRs:        24,960
Exons Changed:      22
Exons/CDS Changed:  3,237
Exon Changed/Prot same: 78
Dropped Models:     215
-------------------------------------------------------
2017-12-22 10:45:40,261: Funannotate update is finished, output files are in the 901.fun_out.wo.RepMod/update_results folder
2017-12-22 10:45:40,263: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (Docker required): 
/home/linuxbrew/.linuxbrew/Cellar/funannotate/0.7.2/libexec/util/interproscan_docker.sh -i=901.fun_out.wo.RepMod/update_results/Linum_usitatissimum.proteins.fa -c=60

Run antiSMASH: 
funannotate remote -i 901.fun_out.wo.RepMod -m antismash -e youremail@server.edu

Annotate Genome: 
funannotate annotate -i 901.fun_out.wo.RepMod --iprscan 901.fun_out.wo.RepMod/update_results/Linum_usitatissimum.proteins.fa.xml --cpus 60 --sbt yourSBTfile.txt
-------------------------------------------------------

There is version 0.7.2 mentioned in log files. Don' worry about this - I just replaced all the content of libexec/bin, libexec/lib etc. folders with the most recent content from github.

nextgenusfs commented 6 years ago

I have not seen this error before:

[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002626458:<2404879->2404950, relative line 1031368

Are you able to show me the gene models that are causing the error in the GenBank tbl file? There could be a problem with the format of the tbl file I guess.

gskrasnov commented 6 years ago

I've opened access to the analysis dir these models have already appeared during funannotate predict run in the Linum_usitatissimum.validation.txt file. There are about 2000 such cases.

gskrasnov commented 6 years ago

I think this issue may come from the fact that I did not run funannotate fix before funannotate update. I'm trying to run funannotate fix now for another genome assembly and I will report if these errors occur againg

gskrasnov commented 6 years ago

No. Even if I have run funannotate fix before funannotate update, I get the same errors (funannotate-update.log). For example:

[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002355145:<22273->22345, relative line 10326
....

and so on.

./update_results/Linum_usitatissimum.tbl:

<22273  >22345  gene
            locus_tag   LX_000642
<22273  >22345  tRNA
            product tRNA-Asp
            note    

./update_results/Linum_usitatissimum.gff3

jcf7180002355145    GenBank gene    22273   22345   .   +   .   ID=LX_000642;
jcf7180002355145    GenBank tRNA    22273   22345   .   +   .   ID=LX_000642;Parent=LX_000642;product=tRNA-Asp;
jcf7180002355145    GenBank exon    22273   22345   .   +   .   ID=LX_000642.exon1;Parent=LX_000642;

Or another error:

[tbl2asn] ERROR: Qualifier 'note' has no value on tRNA feature at lcl|jcf7180002390531:c>34725-<34653, relative line 85921

For some errors like these ones, I did not find the corresponding gene models...

nextgenusfs commented 6 years ago

Ok thanks. Looks like an empty note annotation is the error, I'll see if I can find out where its coming from. To confirm, it is only happening on tRNA gene models? And did this error happening with funannotate predict results or only after funannotate update?

nextgenusfs commented 6 years ago

I'm traveling for the holidays, so I don't have a lot of time to test, but should later this week. I think these fixes should solve the issues, https://github.com/nextgenusfs/funannotate/commit/96f6fe5ebd9e3af623eb9d2908efd06252cc2bfe. Basically the update tbl generating script was outputting an empty "note" section causing the error, although I don't think it is a fatal error. I then also tried to add an internal stop check to the funannotate update script to prevent some of the other errors you had with interproscan, i.e. I don't want the script to choose models with internals stops which will cause all sorts of downstream problems. Let me know if this works.

gskrasnov commented 6 years ago

Yes, it's only happening for tRNA models. I will test these updates I posted a comment in the 96f6fe5 fix concerning Interproscan

Thank you. Merry Christmas!

gskrasnov commented 6 years ago

I have run the most recent versions of funaannotate predict, funaannotate fix and funannotate update. They are fixed (96f6fe5 fix). Unfortunately, the bug did not disappeared. I pointed that file all_proteins.fasta does not perfectly correspond to Linum_usitatissimum.proteins.fa file. All 'normal' proteins which begins with methyonine, are OK. But look at two similar records in these files. all_proteins.fasta:

>FLL_000151-T1 FLL_000151   jcf7180002550921:23-2610(+)
VKGRLWANVFKFTLDLNRFYDAYCAIISNPDEESKQICLRRFIIVLYERGAMKVLCDGKL
PFTGLVEKIESELAWKAERADIMAKPNLYKLLYAFEMHRHNWRKAAGYIYEYSFRLKTEV
LVKDQKHSSLFLQERLNGISAAINALHLVHPAYAWIDPRSEKTSHFLEVAHGQSQKAQEQ
SYIDLDKLEDEFVLTSAEYTLSLANIEWSYTGKEAPSVLVNTLVQANLYDMAFTVILKFW
KGSGLRRELERAFSAMSLKCFPSKFSSTSTELNQLLLTSSEENNQCSPGMGSMYQQPKGN
SHWETLEFYLEKYKTFHAGLPVTVAETLLRTDSLIELPLWLVQMFKDGRRDRTLGMTGQE
SNAASLLRLYVNCGRHMEAVNLLIEYIDSFATARSSNLVKCKRPLSSWFPYSTVELLWCQ
LEDSIRAGHLVDQCEKLKTLLHGALHKHLKLMKVDSEDAMAAAS

Linum_usitatissimum.proteins.fa:

>FLL_000151 ncbi:FLL_000151-T1
CQGTIVG*CFQVHFRPESLL*CILCNNFKSR*RKQTDLPPAFYHSTL*AWSNEGSM*WEIAVYRFS*EDRERACLEGRACRYNGKA*PL*VALCI*NA*T*LAEGSRLYIRVFIPIKDRSVG*RSKALVTVSAREAEWYLCCN*RIASCTPSICLD*SPVRENVAFS*SSSWSVSESARTILYRS**T*G*VCADFS*IHTVTSKY*MELHRKRSTVSFGQYIGPGQPV*YGIYSYFEILEGIRAEEGTRKSLLCYVIEMLSK*IQLHLYRAEPTSFDIL*RKQSMLTWNGVHVSAT*GEFSLGNSRVLS*EV*NLPCWIASDCC*DFAPNRFPY*VASLACTNVQGWSPGQDIRNDWTRIKRCFLAPTLCKLWSAYGSCKSPDRVYRFICHSEIIKSSEM*KATIFMVPIFDGRAAVVPT*GFN*SWSSGRSMREAKNSVTWSFTQASEAYEGGF*RCNGCCFL

Indeed, all_proteins.fasta does not contain any stop-codon

These proteins have almost similar length (465 and 466). I think, the translation start is shifted in Linum_usitatissimum.proteins.fa. These proteins (and genes) seem to be incomplete (truncated at N-terminus). This happens in 984 cases of 76272 proteins, e.g. 1.3% proteins. Maybe one should exclude these incomplete genes, or make a note concerning its incompleteness and indicate the correct translation frame in gbk file...

nextgenusfs commented 6 years ago

Okay. So I'm concerned that there might be a bug somewhere. The all.proteins.fa file is generated from EVM scripts and thus should be "correct" in the sense that the script was written by Brian Haas who also wrote Trinity/PASA (so I'm fairly certain there isn't an error in the GFF3 format). The species_name.proteins.fa is generated from the GenBank file, that is the conversion from GFF to tbl format (leading to genbank flat file via tbl2asn). So there might be a problem with some of the GFF3 CDS phases (i.e. the GFF3 format is not being converted correctly). The phase in GFF3 files is kind of a stupid column and is hard to deal with, i.e. a CDS phase of a particular gene depends on the previous phase being correct. It is certainly possible that I missed something and phases aren't being recorded correctly for all gene models.

Are these data available from those links you posted previously? I think I need to look at these problematic gene models through the pipeline so I can find out where the errors are coming from.

gskrasnov commented 6 years ago

Yes, the links are working now. Here is the most recent run (see the 901.fun_out.wo.RepMod/ folder).

Concerning tRNA 'notes', it is Okay now. Thank you!

nextgenusfs commented 6 years ago

Ok thanks. I've looked at the files and it seems to only be happening to models that are 5' partial and in the GFF file have their first CDS phase be 0. So right now funannotate interprets that as a codon start of 1 (tbl format is typically GFF3 phase+1). However, these models should actually be a different codon start. I'm trying to figure out how I can tell which phase/codon_start to assign when gene models are 5' partial.

nextgenusfs commented 6 years ago

Thanks George. I think I found the bug, should be fixed here https://github.com/nextgenusfs/funannotate/commit/f6898e1e0facdac22cdc595ceee79c074e04dcf9. Basically I think the error was in the funannotate update script when it was generating the GFF3 file from the GBK file it was not grabbing the correct phase information into the GFF3, thus those models where PASA did not update their locations AND were 5' partial, had the wrong starting phase of 0 and thus the wrong translation. You will have to re-run the funannotate update script from scratch I believe, as the GFF3 file being fed to PASA will change slightly. Easiest would be to just delete the update_misc and update_results folders from the output folder and then re-issue the funannotate update command.

gskrasnov commented 6 years ago

Great thanks, Jon! I have just launched funannotate update after deleting update_* folders in the results directory.

gskrasnov commented 6 years ago

Yes, it's okay now! No stop codons and '-' symbols in the Linum_usitatissimum.proteins.fa are present.

nextgenusfs commented 6 years ago

Awesome. Let me know if you have anymore problems. I'm going to try to get a release out in next few days. So if you have any issues with functional annotation let me know. I noticed you were trying to use the antiSMASH remote script, that is currently only working with fungal genomes. They have a plantismash server, however, the API is different and I haven't figured it out yet so the command line remote plantismash is not working. You can however, submit your GBK file from funannotate update to the plantismash webserver, you then just need to supply funannotate annotate with the secondary metabolite annotated GBK file to the --antismash option.

gskrasnov commented 6 years ago

Exactly, I tried funannotate remote with antismash but got an error. I have already uploaded gbk files to the plantismash server and received the results. Thank you. Interproscan is running now (I think it will be running for the next few days). Parallelly, I plan to launch eggnogg-mapper for strNOG. And then, run funannotate annotate and funannotate compare in order to compare these two genomes currently being annotated.

nextgenusfs commented 6 years ago

Yes, interproscan is slow. Both Eggnog database and interproscan database are too large for me to try to manage within funannotate, but I would recommend running both as they generate a lot of functional annotation (GO terms, IPR terms, Gene Names, product deflines, etc). After you run annotate, you are certainly going to get a message at the end with gene name/product definitions that need to be fixed. You will also get a list of names that pass NCBI requirements and a message saying that "I need your help" to do a PR for the gene name/product definition database I'm using. Its located here. https://github.com/nextgenusfs/gene2product. Basically in previous versions of funannotate I did not pull gene names and product definitions as there wasn't a way to curate them and you end up having to spend hours and hours manually editing them. What I've done with v1.0.0 is to parse UniProt/Eggnog annotations for gene names and definitions, it then does a lookup in the gene2product database where I hope to keep the "curated" product definitions. Eventually this database will be populated and well curated and should result in very minimal manual curation when submitting to NCBI.

If you have eggnog-mapper installed locally, funannotate annotate will run it if you don't give it pre-computed results via --eggnog. In communications with eggnog-mapper developer, they claim that they diamond search method was much faster and more sensitive, so that is what funannotate annotate will run. You just have to make sure that the diamond database distributed with eggnog-mapper has been constructed with the same version of diamond, the distrubuted version was formatted with diamond v0.8.xx and is actually incompatible with diamond versions > 0.9.x, so the solution is to reformat the database. See more here: https://github.com/jhcepas/eggnog-mapper/issues/48. Simplest way is to delete the downloaded version and then just do it manually, like this:

#download protein models
wget http://eggnogdb.embl.de/download/eggnog_4.5/eggnog-mapper-data/eggnog4.clustered_proteins.fa.gz

#make diamond database
diamond makedb --in eggnog4.clustered_proteins.fa.gz --db eggnog_proteins

The HMM mediated eggnog-mapper is also fine to run, perhaps the --use-mem option is fairly fast, although I couldn't get it to work on my mac.

nextgenusfs commented 6 years ago

Thanks for all your help George, I'm going to close this thread as I've released v1.0.0. If you have more problems, please open a new issue.