nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

question about PASA step #796

Open tothuhien opened 2 years ago

tothuhien commented 2 years ago

Hi, I've tried to annotate a fungi assembly using an available RNA-seq data. For the first run, for some reason PASA failed, but the program still run and gave me about 10k genes. For a new run, I have PASA worked, but now it gave me only about 3k genes. I tried to run busco on the generated proteins sequences, the first one gave me 97% completed score while the second one only gave me 43% completed. Do you know if this behaviour is normal, and how should I run the program in this case? Thank you for any advice. Hien

hyphaltip commented 2 years ago

Are you running, train, then predict ?

Would be helpful to look at the log files and report any errors. the logfiles folder has a funannotate-train.log file which may have errors recorded. Any errors in there to indicate the RNAseq could not be incorporated? Are you giving unassembled raw reads with a fwd and reverse file?

Does the predict step indicate it is using the RNAseq something like this where FOLDER is the name of your top level output folder you gave to funannotate?

05/06/21 22:06:48]: Found training files, will re-use these files:
  --rna_bam FOLDER/training/funannotate_train.coordSorted.bam
  --pasa_gff FOLDER/training/funannotate_train.pasa.gff3
  --stringtie FOLDER/training/funannotate_train.stringtie.gtf
  --transcript_alignments FOLDER/training/funannotate_train.transcripts.gff

In the partially complete gene predictions, are genes predicted on every contig but just sparsely or did it fail part of the way and so only a few contigs have genes? you can check with looking at the predicted genes gff file with grep -P "\tgene\t" GFF_FILE | cut -f1 | sort | uniq -c which will show you the distribution of number of genes found on each contig.

tothuhien commented 2 years ago

Hi, thanks for your respond. Yes I run train and then predict, and yes I gave unassembled raw reads with forward and reverse file. For the first run, I see this message:

[Jul 01 03:13 PM]: Found training files, will re-use these files: --stringtie fun/training/funannotate_train.stringtie.gtf [Jul 01 03:13 PM]: Skipping CodingQuarry as no --rna_bam passed [Jul 01 03:13 PM]: Parsed training data, run ab-initio gene predictors as follows: Program Training-Method augustus busco
glimmerhmm busco
snap busco [Jul 01 03:13 PM]: Loading genome assembly and parsing soft-masked repetitive sequences [Jul 01 03:13 PM]: Genome loaded: 35 scaffolds; 37,569,994 bp; 4.85% repeats masked [Jul 01 03:13 PM]: Mapping 554,696 proteins to genome using diamond and exonerate [Jul 01 08:00 PM]: Found 428,841 preliminary alignments --> aligning with exonerate [Jul 02 02:57 AM]: Exonerate finished: found 2,747 alignments [Jul 02 02:57 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors [Jul 02 03:45 AM]: 865 valid BUSCO predictions found, validating protein sequences [Jul 02 03:47 AM]: 856 BUSCO predictions validated [Jul 02 03:47 AM]: Training Augustus using BUSCO gene models [Jul 02 03:47 AM]: Augustus initial training results: Feature Specificity Sensitivity nucleotides 96.4% 80.2%
exons 64.8% 62.1%
genes 39.0% 24.1%
[Jul 02 03:47 AM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option. [Jul 02 03:47 AM]: Running Augustus gene prediction using myfungi parameters [Jul 02 04:03 AM]: 8,654 predictions from Augustus [Jul 02 04:03 AM]: Pulling out high quality Augustus predictions [Jul 02 04:03 AM]: Found 31 high quality predictions from Augustus (>90% exon evidence) [Jul 02 04:03 AM]: Running SNAP gene prediction, using training data: fun/predict_misc/busco.final.gff3 [Jul 02 04:05 AM]: 12,913 predictions from SNAP [Jul 02 04:05 AM]: Running GlimmerHMM gene prediction, using training data: fun/predict_misc/busco.final.gff3 [Jul 02 04:11 AM]: 10,728 predictions from GlimmerHMM [Jul 02 04:11 AM]: Summary of gene models passed to EVM (weights): [Jul 02 04:11 AM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval [Jul 02 05:43 AM]: Converting to GFF3 and collecting all EVM results Source Weight Count Augustus 1 8623 Augustus HiQ 2 31
GlimmerHMM 1 10728 snap 1 12913 Total - 32295 [Jul 02 05:43 AM]: 10,682 total gene models from EVM [Jul 02 05:43 AM]: Generating protein fasta files from 10,682 EVM models [Jul 02 05:43 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc). [Jul 02 05:44 AM]: Found 637 gene models to remove: 1 too short; 0 span gaps; 636 transposable elements [Jul 02 05:44 AM]: 10,045 gene models remaining [Jul 02 05:44 AM]: Predicting tRNAs [Jul 02 05:45 AM]: 245 tRNAscan models are valid (non-overlapping) [Jul 02 05:45 AM]: Generating GenBank tbl annotation file [Jul 02 05:45 AM]: Collecting final annotation files for 10,290 total gene models [Jul 02 05:45 AM]: Converting to final Genbank format [Jul 02 05:46 AM]: Funannotate predict is finished, output files are in the fun/predict_results folder

For the second run, I have this message:

[Jul 07 09:59 AM]: Found training files, will re-use these files: --rna_bam funannotate/training/funannotate_train.coordSorted.bam --pasa_gff funannotate/training/funannotate_train.pasa.gff3 --stringtie funannotate/training/funannotate_train.stringtie.gtf --transcript_alignments funannotate/training/funannotate_train.transcripts.gff3 [Jul 07 09:59 AM]: Parsed training data, run ab-initio gene predictors as follows: [4mProgram Training-Method[0m augustus pasa
codingquarry rna-bam
glimmerhmm pasa
snap pasa
[Jul 07 09:59 AM]: Loading genome assembly and parsing soft-masked repetitive sequences [Jul 07 10:00 AM]: Genome loaded: 35 scaffolds; 37,569,994 bp; 4.85% repeats masked [Jul 07 10:00 AM]: Parsed 1,762 transcript alignments from: funannotate/training/funannotate_train.transcripts.gff3 [Jul 07 10:00 AM]: Creating transcript EVM alignments and Augustus transcripts hintsfile [Jul 07 10:00 AM]: Extracting hints from RNA-seq BAM file using bam2hints [Jul 07 10:00 AM]: Mapping 554,696 proteins to genome using diamond and exonerate [Jul 07 10:04 AM]: Found 328,330 preliminary alignments with diamond in 0:03:25 --> generated FASTA files for exonerate in 0:00:45 [Jul 07 11:20 AM]: Exonerate finished in 1:16:07: found 2,547 alignments [Jul 07 11:20 AM]: Filtering PASA data for suitable training set [Jul 07 11:21 AM]: 461 of 763 models pass training parameters [Jul 07 11:21 AM]: Training Augustus using PASA gene models [Jul 07 11:21 AM]: Augustus initial training results: [4mFeature Specificity Sensitivity[0m nucleotides 97.4% 71.6%
exons 63.6% 53.4%
genes 20.7% 19.2%
[Jul 07 11:21 AM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option. [Jul 07 11:21 AM]: Running Augustus gene prediction using myfungi parameters [Jul 07 11:29 AM]: 3,538 predictions from Augustus [Jul 07 11:29 AM]: Pulling out high quality Augustus predictions [Jul 07 11:29 AM]: Found 413 high quality predictions from Augustus (>90% exon evidence) [Jul 07 11:29 AM]: Running CodingQuarry prediction using stringtie alignments [Jul 07 11:29 AM]: CMD ERROR: CodingQuarry -p 10 -f funannotate/predict_misc/genome.softmasked.fa -t funannotate/predict_misc/stringtie.gff3

This one was stuck on the step CodingQuarry but I just skip it and still have my gff3 file.

I used the command you suggested above on the 2 gff3 files and it seems that there's no obvious bias on some contigs: The first one:

1222 scaffold_1 523 scaffold_10 478 scaffold_11 501 scaffold_12 363 scaffold_13 295 scaffold_14 240 scaffold_15 182 scaffold_16 125 scaffold_17 113 scaffold_18 28 scaffold_19 1010 scaffold_2 27 scaffold_20 48 scaffold_21 9 scaffold_22 2 scaffold_23 2 scaffold_24 2 scaffold_25 1 scaffold_27 927 scaffold_3 1 scaffold_31 2 scaffold_32 853 scaffold_4 800 scaffold_5 777 scaffold_6 670 scaffold_7 547 scaffold_8 542 scaffold_9

The second one:

480 scaffold_1 166 scaffold_10 175 scaffold_11 174 scaffold_12 112 scaffold_13 99 scaffold_14 73 scaffold_15 54 scaffold_16 46 scaffold_17 45 scaffold_18 10 scaffold_19 393 scaffold_2 5 scaffold_20 56 scaffold_21 4 scaffold_22 1 scaffold_23 357 scaffold_3 317 scaffold_4 292 scaffold_5 272 scaffold_6 247 scaffold_7 207 scaffold_8 186 scaffold_9

Do you have any ideas? We expected to have ~ 10k genes like in the first run, but that one failed on pasa step. So I don't know which one should we take. Thanks.

hyphaltip commented 2 years ago

Sorry - I'm not really sure from this information - but clearly augustus is dropping genes with the data provided but both times it complains about low accuracy of the gene models predicted -I'm not sure where in this it is a PASA issue per-se but perhaps the evidence is causing conflicting signals in your predict data. It does suggest running augustus optimize - you could try to generate a better augustus training set and prediction parameters to get that optimized.

Parsed 1,762 transcript alignments from: funannotate/training/funannotate_train.transcripts.gff3 looks like reasonable number of transcript alignments but still not a ton (if you have 10k genes this means ~10% of genes are expressed and have an aligned transcript in your dataset.

so you are sure the RNAseq you are using is from the same individual or close strain to the one with the genome you are annotating?

you could examine how many of the predicted transcripts in your first annotation that produced 10k genes - how many of them have expression (you can do this quickly with kallisto).

If you look at your data and the evidence in a genome browser are there any obvious trends in why genes were called in previous version vs current one?

tothuhien commented 2 years ago

Hi. Thanks for your suggestion. I think in the first run, since pasa failed and there was no bam file neither, the program didn't use the RNAseq to predict genes. It used busco instead of pasa gene models. I rerun the program again without providing RNAseq (only predict step) and I got approximately the same genes number:

augustus busco
genemark selftraining
glimmerhmm busco
snap busco

Source Weight Count Augustus 1 8036 Augustus HiQ 2 31 GeneMark 1 12784 GlimmerHMM 1 10557 snap 1 11549 Total - 42957 11,395 total gene models from EVM 10,673 gene models remaining after removing bad gene models

And when I rerun the program with RNAseq (training + predict)

augustus pasa
genemark selftraining
glimmerhmm pasa
snap pasa

Source Weight Count Augustus 1 2816 Augustus HiQ 2 659 GeneMark 1 12832 GlimmerHMM 1 3381 pasa 6 871 snap 1 2481 Total - 23040 3,554 total gene models from EVM 3,524 gene models remaining after removing bad gene models

The RNAseq comes from the same species, not sure if same strain, and does not come from the same individual. So do you think that it's because the RNAseq do not express much or may came from different strain, and in this case is it better to go without RNAseq?

I have tried to see the gff3 files with IGV and here's a small region (the final gff3 files are myfungi_withRNA and myfungi_noRNA). I'm not sure if there's anything wrong.

Screenshot 2022-10-07 at 13 24 17

Thanks.

hyphaltip commented 2 years ago

I just upgraded augustus to 3.5.0 in bioconda - I would also try to run this as it may have impacted training with problems parsing PP files -- I would remove all the training files as I don't think it forces re-reunning these, you can also run augustus with --optimize options to improve training from the RNASeq training data?

tothuhien commented 2 years ago

Hi, I did use option --optimize_augustus when I rerun them again in the previous post. I just tried the new bioconda version you updated but I got the error at Augustus step:

augustus: ERROR Could not locate command line parameters file: fun_withRNA/predict_misc/ab_initio_parameters/augustus/parameters/aug_cmdln_parameters.json.

Not sure if I did something wrong but do you know about that error? Thanks.

hyphaltip commented 2 years ago

I've never seen that error before - maybe remove this folder fun_withRNA/predict_misc/ab_initio_parameters/augustus/ and try again?

tothuhien commented 2 years ago

Thanks, I tried that but still had the same error. I saw someone has the same problem here https://github.com/nextgenusfs/funannotate/issues/776 but I don't know how to fix in my case where I install with mamba.