nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 85 forks source link

funannotate train: PASA fail with congener RNAseq reads #930

Closed igwill closed 1 year ago

igwill commented 1 year ago

Hi,

I am trying to annotate fungal genome assemblies where there is no transcriptomic data available for the exact species. My understanding is that RNAseq from a related species (~same genus or even beyond) can be better than going full ab initio. I am first piloting/benchmarking approaches with a species where I do have RNAseq available and comparing outputs when given no, same-genus, or same-order reads. for example running: funannotate train --input softmasked.assembly.fasta --out congener_test --left SRR11547909_1_new.fastq SRR11547912_1_new.fastq SRR11547913_1_new.fastq --right SRR11547909_2_new.fastq SRR11547912_2_new.fastq SRR11547913_2_new.fastq --species "My fungus" --isolate "A" --cpus 16

When using the congener or fellow Hypocrelean fungi, I've been running into a problem with funannotate train getting this error on the PASA step: CMD ERROR: Launch_PASA_pipeline.pl -c training/pasa/alignAssembly.txt -r -C -R -g training/genome.fasta --IMPORT_CUSTOM_ALIGNMENTS training/trinity.alignments.gff3 -T -t training/trinity.fasta.clean -u training/trinity.fasta --stringent_alignment_overlap 30.0 --TRANSDECODER --ALT_SPLICE --MAX_INTRON_LENGTH 3000 --CPU 16 --ALIGNERS blat --trans_gtf training/funannotate_train.stringtie.gtf

Up to this point in the pipeline processing, funannotate_train.stringtie.gtf* has been made. trinity.fasta is there in /training/, but not funannotate_train.trinity-GG.fasta. (and of course the funanntoate pasa.gff3 and coordSorted.bam are not there yet either).

I can successfully complete funannotate train when using RNAseq from the same species in all three ways I tested:

The error only happens on fastq data when going across species (all data from SRA). fastq formatting from SRA: initially Trinity did not like them:

@1/1
NGATCTGAATCTTGCCGTTGGTGTCGAC
+SRR11547909.1 1 length=151
#FFFFFFFFFFFFFFFFFFFFFFFFFFF

sed cleaned and used in funannotate train:

@1/1
NGATCTGAATCTTGCCGTTGGTGTCGAC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFF

I feel like the issue is some related to low/poor mapping of the transcripts? e.g. with the congener:

## from the stdout
[Jun 30 07:40 PM]: Assembling 131 Trinity clusters using 15 CPUs
[Jun 30 07:43 PM]: 143 transcripts derived from Trinity # whereas my on-species tests had this number at >20k
     Progress: 131 complete, 0 failed, 0 remaining

## from funannotate-train.log
# trimmomatic info, we got plenty of reads
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 72794902 Both Surviving: 53902809 (74.05%) Forward Only Surviving: 18891456 (25.95%) Reverse Only Surviving: 0 (0.00%) Dropped: 637 (0.00%)
TrimmomaticPE: Completed successfully

# a little later in the logfile
 Launching actual cleaning process:
 psx -p 16  -n 1000  -i trinity.fasta -d cleaning -C trinity.fasta:ANLMS100:::11:0 -c pasa-2.5.2/bin/seqclean.psx
Collecting cleaning reports

**************************************************
Sequences analyzed:       143
-----------------------------------
                   valid:       143  (0 trimmed)
                 trashed:         0
**************************************************
# end of logfile
[06/30/23 19:43:48]: Converting transcript alignments to GFF3 format
[06/30/23 19:43:48]: Converting Trinity transcript alignments to GFF3 format
[06/30/23 19:43:49]: Running PASA alignment step using 143 transcripts
# ...
[06/30/23 19:44:28]: CMD ERROR: # .... as the error above in the post

Pushing it with the other fungus in the Hypocreales reduces "transcripts derived from Trinity" (although that second number is lager):

[Jun 30 05:47 PM]: Assembling 619 Trinity clusters using 15 CPUs
[Jun 30 05:54 PM]: 58 transcripts derived from Trinity
     Progress: 619 complete, 0 failed, 0 remaining

As I thought cross-species RNAseq model training was somewhat standard, I'm a little surprised that the results are so bad that PASA can't complete? Could there be some parameter I need to set in funannanote train for it to pass along to PASA or other steps?

Thank you!

hyphaltip commented 1 year ago

I'd interpret this to be because too few of the cross-species read align to the genome.

You can attempt to do a de-novo trinity assembly of the transcripts outside of funannotate and then give the transcript assembly fasta file to funannotate predict with the --transcript_evidence trinity.fasta option. This would maybe rescue a few more genes in that exonerate may be able to align the longer full transcripts than the short-reads can be aligned with hisat initially to make the alignments trinity-gg uses to build local transcript alignments and assembly.

But generally if train fails with the raw RNAseq like this, it is because the short reads fail to align well enough between the species - I found that sometimes a misidentified species this was a diagnostic of a problem there, or that the RNAseq was not from right species, but either way this is caused because the RNA reads come from a species that is too distant.

If you look at the hisat2.coordSorted.bam file in the training folder with samtools samtools flagstat hisat2.coordSorted.bam you can see what the % mapping was for the reads. I'd predict it was really low and only you got a few really conserved genes picked up.

there are other tools designed for cross-species RNAseq alignments but in this case trinity-gg is trying to optimize alignments for within species. If you have cross-species I would focus on providing that as transcript evidence to predict step. You can also achieve some more specificity if you provide cross-species proteins as input, but neither will really help for UTR updates when you go to the 'update' step - for that you really need within-species RNAseq datasets I think.

igwill commented 1 year ago

Yes, indeed looking at hisat2.coordSorted.bam shows <2% mapped. yeesh. For now, I'll see how much worse my ab initio test is compared to the on-species RNAseq trained and see if taking extra steps outside funannotate would be worth it. But since that would be outside the scope of only this pipeline, I'll close the issue for now. Thanks for the fast response!!