Too many genes predicted

LemoAlex commented 3 years ago

Hello,

I went through the whole pipeline of funannotate and was able to obtain the Json file. The file output is :

{ "assembly": { "num_contigs": 9557, "length": 994064042, "mean_length": 104014.23, "N50": 247983, "L50": 1015, "N90": 48506, "L90": 4506, "GC_content": 40.72 }, "annotation": { "genes": 74221, "common_name": 10466, "mRNA": 66829, "tRNA": 8688, "ncRNA": 0, "rRNA": 0, "avg_gene_length": 4108.57, "transcript-level": { "CDS_transcripts": 66829, "CDS_five_utr": 0, "CDS_three_utr": 9344, "CDS_no_utr": 47625, "CDS_five_three_utr": 9860, "CDS_complete": 63706, "CDS_no-start": 1507, "CDS_no-stop": 1457, "CDS_no-start_no-stop": 159, "total_exons": 311257, "total_cds_exons": 272674, "multiple_exon_transcript": 43222, "single_exon_transcript": 23607, "avg_exon_length": 197.74, "avg_protein_length": 248.97, "functional": { "go_terms": 0, "interproscan": 0, "eggnog": 0, "pfam": 21137, "cazyme": 271, "merops": 1198, "busco": 888, "secretion": 0 }, "pct_exon_overlap_protein_evidence": 0.56, "pct_exon_overlap_transcript_evidence": 32.57 } } }

One odd thing for me is the number of genes here, which is 74k ! Other related organisns are more around 20-30K protein coding genes in general, which makes this results suspicious. Also, the 888 busco genes seem low to me as when I ran the Busco analysis separately, I had around 3k BUSCO (85%).

As input, I used the softmasked Assembly (funannotate mask using tantan), transcriptome information (using trinity / PASA through funannotate train), and protein evidence of somehow close species downloaded from uniprot. Basically that's my "pipeline".

funannotate mask -i ~/input.fasta -o ~/output.masked
funannotate train -i output.masked.fa -s RawRNAreads.fq -o fun.train.1
funannotate predict -i ~/output.masked.fa -o un.train.1/ -s "Ancistrus Triradiatus" --optimize_augustus --protein_evidence uniprot-catfish-reviewed.fasta uniprot-zebrafish-reviewed.fasta --organism other --rna_bam ~/funannotate/transcriptome.genome.alignment.bam --busco_db actinopterygii --weights codingquarry:1 --cpus 6
funannotate update -i /home/alexandre/funannotate/train/fun.train.1/ --cpus 6
.funannotate annotate -i /fun.train.1/ --cpus 6

Then what could cause this numberof genes ? Could it be that I need to mask the genome differently ? Or should I potentially filter my initial assembly for a minimum length of scaffolds, as relatively small scaffolds could potentially bias the results?

Thanks for your help,

Alexandre

nextgenusfs commented 3 years ago

Look at the predict log file and determine which ab initial predictor is yielding the high number of gene predictions, most likely fragmented. When you pass --organism other it turns off coding quarry by default as this has been shown by others to not work well for non fungi, so since you passed --weights codingquarry:1, this turned on coding quarry. My guess is that coding quarry is predicting many more genes than should be there causing fragmented models. You can simply just re-run the predict command if that is the case and change this to --weights codingquarry:0.

LemoAlex commented 3 years ago

my log file says : 134,474 predictions from CodingQuarry

I guess that's too much ! I'll run funannotate predict again!

Thanks, Alexandre

nextgenusfs commented 3 years ago

The other place you might want to spend more time would be in the masking of repeats, 40% GC content means you have probably some large AT rich regions and perhaps significant repeat regions. For various reasons (mostly though license), the default method is to run a very simple tantan repeat masking, you'd probably be better off running something more thorough if you are seeing lots of genes showing up in/around repeat regions. RepeatMasker RepBase libraries are no longer open source, so I'm not up to date with the current best practice on this.

LemoAlex commented 3 years ago

Hello again,

I am again running the pipeline, without coding quarry and with my assembly masked through repeatmasker this time. I still have some concerns about the gene predictions however, It seems I still have too many protein coding genes. When looking at the log files, I see:

Summary of gene models: {'total': 287084, 'Augustus': 45881, 'HiQ': 6125, 'GlimmerHMM': 126589, 'pasa': 28513, 'snap': 79976}

I might be wrong, but it seems glimmerHMM and snap tend to really overstimate the predicted genes? Could there be a specific reason why? They already have a weight of 1, so downscaling this would simply turn them off right?

I also planned on running the predictions with the --repeats2evm options activated. Could that help? Cheers,

Alexandre

LemoAlex commented 3 years ago

Hello,

I ended up putting glimmerHMM and Snap weights to 0 as they were predicting crazy ammount of genes (120k and 70k).

One last and stupid question, in the obtained JSon file, what would correspont to the "protein coding genes" which is often reported in genome assemblies ? Would it be it the number after "genes"? "annotation": { "genes": 30331, "common_name": 4217, "mRNA": 21392, "tRNA": 8939, "ncRNA": 0, "rRNA": 0, "avg_gene_length": 3761.26, "transcript-level": { "CDS_transcripts": 21392, "CDS_five_utr": 0, "CDS_three_utr": 0, "CDS_no_utr": 21392, "CDS_five_three_utr": 0, "CDS_complete": 19699, "CDS_no-start": 739, "CDS_no-stop": 851, "CDS_no-start_no-stop": 103, "total_exons": 113811, "total_cds_exons": 113811, "multiple_exon_transcript": 20625, "single_exon_transcript": 767, "avg_exon_length": 165.62, "avg_protein_length": 300.5, "functional": { "go_terms": 0, "interproscan": 0, "eggnog": 0, "pfam": 8330, "cazyme": 134, "merops": 553, "busco": 389, "secretion": 1765

Thank you so much for your help,

Best Regards, Alexandre

nextgenusfs commented 3 years ago

Protein coding is the 21k number. Seems like a lot of tRNA predictions at nearly 9k, is that expected?

LemoAlex commented 3 years ago

Related species have around 20-25k protein coding genes and 8-9k tRNA genes annotated, so I guess it makes sense.

Thanks for the answers!

nextgenusfs commented 3 years ago

It's always a difficult to answer question as there is no "right answer". There is some bias towards which software was used for predictions, ie augustus and genemark historically were used by themselves for lots of annotation projects. So if a diff software or method yields different results the assumption is that it is wrong, but that likely isn't completely true. Using RNA-seq from as many different conditions as possible is pretty much the only real experimental data that could shed light on the "true" number of protein coding genes.

nextgenusfs / funannotate

Too many genes predicted #537