Closed igwill closed 1 year ago
I'd interpret this to be because too few of the cross-species read align to the genome.
You can attempt to do a de-novo trinity assembly of the transcripts outside of funannotate and then give the transcript assembly fasta file to funannotate predict with the --transcript_evidence trinity.fasta
option. This would maybe rescue a few more genes in that exonerate may be able to align the longer full transcripts than the short-reads can be aligned with hisat initially to make the alignments trinity-gg uses to build local transcript alignments and assembly.
But generally if train fails with the raw RNAseq like this, it is because the short reads fail to align well enough between the species - I found that sometimes a misidentified species this was a diagnostic of a problem there, or that the RNAseq was not from right species, but either way this is caused because the RNA reads come from a species that is too distant.
If you look at the hisat2.coordSorted.bam
file in the training folder with samtools samtools flagstat hisat2.coordSorted.bam
you can see what the % mapping was for the reads. I'd predict it was really low and only you got a few really conserved genes picked up.
there are other tools designed for cross-species RNAseq alignments but in this case trinity-gg is trying to optimize alignments for within species. If you have cross-species I would focus on providing that as transcript evidence to predict step. You can also achieve some more specificity if you provide cross-species proteins as input, but neither will really help for UTR updates when you go to the 'update' step - for that you really need within-species RNAseq datasets I think.
Yes, indeed looking at hisat2.coordSorted.bam shows <2% mapped. yeesh. For now, I'll see how much worse my ab initio test is compared to the on-species RNAseq trained and see if taking extra steps outside funannotate would be worth it. But since that would be outside the scope of only this pipeline, I'll close the issue for now. Thanks for the fast response!!
Hi,
I am trying to annotate fungal genome assemblies where there is no transcriptomic data available for the exact species. My understanding is that RNAseq from a related species (~same genus or even beyond) can be better than going full ab initio. I am first piloting/benchmarking approaches with a species where I do have RNAseq available and comparing outputs when given no, same-genus, or same-order reads. for example running:
funannotate train --input softmasked.assembly.fasta --out congener_test --left SRR11547909_1_new.fastq SRR11547912_1_new.fastq SRR11547913_1_new.fastq --right SRR11547909_2_new.fastq SRR11547912_2_new.fastq SRR11547913_2_new.fastq --species "My fungus" --isolate "A" --cpus 16
When using the congener or fellow Hypocrelean fungi, I've been running into a problem with
funannotate train
getting this error on the PASA step:CMD ERROR: Launch_PASA_pipeline.pl -c training/pasa/alignAssembly.txt -r -C -R -g training/genome.fasta --IMPORT_CUSTOM_ALIGNMENTS training/trinity.alignments.gff3 -T -t training/trinity.fasta.clean -u training/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --ALT_SPLICE --MAX_INTRON_LENGTH 3000 --CPU 16 --ALIGNERS blat --trans_gtf training/funannotate_train.stringtie.gtf
Up to this point in the pipeline processing,
funannotate_train.stringtie.gtf*
has been made.trinity.fasta
is there in /training/, but notfunannotate_train.trinity-GG.fasta
. (and of course the funanntoate pasa.gff3 and coordSorted.bam are not there yet either).I can successfully complete funannotate train when using RNAseq from the same species in all three ways I tested:
The error only happens on fastq data when going across species (all data from SRA). fastq formatting from SRA: initially Trinity did not like them:
sed cleaned and used in funannotate train:
I feel like the issue is some related to low/poor mapping of the transcripts? e.g. with the congener:
Pushing it with the other fungus in the Hypocreales reduces "transcripts derived from Trinity" (although that second number is lager):
As I thought cross-species RNAseq model training was somewhat standard, I'm a little surprised that the results are so bad that PASA can't complete? Could there be some parameter I need to set in funannanote train for it to pass along to PASA or other steps?
Thank you!