nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
320 stars 84 forks source link

Process of utilizing the isoseq or cDNA longreads for training. #572

Closed HenrivdGeest closed 3 years ago

HenrivdGeest commented 3 years ago

Are you using the latest release? funannotate v1.8.4 # singularity image

Describe the bug my long read evidence does not seem to be mapped to the reference properly

What command did you issue? funannotate_v184.simg funannotate train -i /genome.fasta -l 1.renamed.fastq.gz -r 2.renamed.fastq.gz --cpus 128 --out train_part8 --no_normalize_reads --no_trimmomatic --max_intronlen 3000 --aligners minimap2 --pacbio_isoseq isoseq.fastq.gz --out train_part8

Logfiles

OS/Install Information

 -  output of `funannotate check --show-versions`
Checking dependencies for 1.8.4
-------------------------------------------------------
You are running Python v 3.7.9. Now checking python packages...
biopython: 1.78
goatools: 1.0.15
matplotlib: 3.3.3
natsort: 7.1.0
numpy: 1.19.5
pandas: 1.2.1
psutil: 5.8.0
requests: 2.25.1
scikit-learn: 0.24.1
scipy: 1.5.3
seaborn: 0.11.1
All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules...
Bio::Perl: 1.007002
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.855
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.39
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.29
threads: 2.15
threads::shared: 1.56
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/opt/databases
$PASAHOME=/venv/opt/pasa-2.4.1
$TRINITYHOME=/venv/opt/trinity-2.8.5
$EVM_HOME=/venv/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/blabla//augustus/config
    ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir
-------------------------------------------------------
Checking external dependencies...
Traceback (most recent call last):
  File "/venv/bin/ete3", line 6, in <module>
    from ete3.tools.ete import main
  File "/venv/lib/python3.7/site-packages/ete3/tools/ete.py", line 55, in <module>
    from . import (ete_split, ete_expand, ete_annotate, ete_ncbiquery, ete_view,
  File "/venv/lib/python3.7/site-packages/ete3/tools/ete_view.py", line 48, in <module>
    from .. import (Tree, PhyloTree, TextFace, RectFace, faces, TreeStyle, CircleFace, AttrFace,
ImportError: cannot import name 'TextFace' from 'ete3' (/venv/lib/python3.7/site-packages/ete3/__init__.py)
PASA: 2.4.1
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.3.3
bamtools: bamtools 2.5.1
bedtools: bedtools v2.30.0
blat: BLAT v36
diamond: 2.0.6
exonerate: exonerate 2.4.0
fasta: no way to determine
glimmerhmm: 3.0.4
gmap: 2017-11-15
hisat2: 2.2.1
hmmscan: HMMER 3.3.1 (Jul 2020)
hmmsearch: HMMER 3.3.1 (Jul 2020)
java: 11.0.8-internal
kallisto: 0.46.1
mafft: v7.475 (2020/Nov/23)
makeblastdb: makeblastdb 2.2.31+
minimap2: 2.17-r941
proteinortho: 6.0.16
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.10
snap: 2006-07-28
stringtie: 2.1.4
tRNAscan-SE: 2.0.7 (Oct 2020)
tantan: tantan 13
tbl2asn: no way to determine, likely 25.X
tblastn: tblastn 2.2.31+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
    ERROR: emapper.py not installed
    ERROR: ete3 not installed
    ERROR: gmes_petap.pl not installed
    ERROR: signalp not installed

I am wondering how the isoseq/nanopore cdna data is used in the process. I see a bam file for example, isoseq_coordSorted.bam, which I thought contains the isoseq data. However, just a very few reads are aligned in that BAM file. From your code I did see that you use minimap -xsplice, which should be okay. Do you do any filtering on the bam or is there something off in my case? If I map the isoseq reads myself I do see much more alingment data, but so far I only see them at locations with also illumina rna-seq data present, so it seems that he

[Mar 12 01:54 PM]: Clustering of reads from BAM and preparing assembly commands
[Mar 12 02:46 PM]: Assembling 22,815 Trinity clusters using 127 CPUs
[Mar 12 03:02 PM]: 41,171 transcripts derived from Trinity
[Mar 12 03:02 PM]: Running StringTie on Hisat2 coordsorted BAM
[Mar 12 03:03 PM]: Removing poly-A sequences from trinity transcripts using seqclean
[Mar 12 03:03 PM]: Aligning long reads to genome with minimap2
[Mar 12 03:05 PM]: Adding 34 unique long-reads to Trinity assemblies
[Mar 12 03:07 PM]: Merging BAM files: train_part8/training/nano_cDNA.coordSorted.bam, train_part8/training/trinity.alignments.bam

So, I am wondering if the alignment of long reads is performed properly in my case, If it is okay, it would be good to have the 'untouched' isoseq bam file, so that I can verify that the isoseq reads are used properly. Second, is there any way to give the full length cDNA reads (the complete isoseq or nanopore_cDNA set) more wheight than the trinity transcripts? Or do you think this does not matter for the training? Third, not really related to this post, but is trinity using the paired-end information for the gg-assembly process? In the bam files my IGV does not seem to visualize the paired-end information properly, although the sam flags are set to paired mapping.(can also be an hisat2 related issue)

image where there is overlap with illumina-rna-seq:(so isoseq.coordSorted.bam is empty) image

Example where there is no overlap with illumina data ((so isoseq.coordSorted.bam is NOT empty. Minor: it seems the isoseq read is used twice by accident, the fragment names are also identical) image

nextgenusfs commented 3 years ago

The current method if you provide both Illumina and long reads does this:

  1. uses Trinity genome guided for assembly of Illumina data (if reads are PE it will pass them that way to Trinity)
  2. Maps the long reads to the Trinity assemblies and keeps all reads that do not map to Trinity (idea here is there are errors in the long reads and in order to save some time in the PASA step, it does this to essentially reduce the complexity and number of alignments passed to PASA). I realize this will reduce some complexity and could potentially miss some alternatively spliced transcripts -- but the goal here for train is to identify high quality evidence based models, and we will only pick the top model at each locus for training. Alternatively spiced transcripts are added back later at the funannotate update step.
  3. The minimap2 alignments from the Trinity+unique long reads are then passed to PASA as alignments and it will also run blat alignments of those data. PASA is not very fast due to the SQL database steps (in SQLite mode its actually only single threaded so can be really really slow if you give it all the data).
  4. PASA results are then run through TransDecoder to pick the most likely coding genes
  5. Raw Illumina data is mapped back to the TransDecoder selected PASA models using Kallisto to estimate the highest expressed transcript at each locus. This is why it is beneficial to give funannotate train the raw Illumina data, let it q-trim, and normalize for assembly -- but then have access to the raw data to determine which transcript at each locus is the most highly expressed.
  6. Best PASA models, RNA-seq alignments, strinbtie, etc are all then passed to funannotate predict to to the PASA mediated training, generate hints for Augustus, etc.

So there is likely room for improvement here -- the long read data was added to the pipeline sort of supplemental data. Gene model predictions are only made from the ab initio predictors -- ie gene models are never called directly from transcript or protein alignments, so the focus on this step is to try to provide the best training set you can as that is where the primary gene models will come from.

HenrivdGeest commented 3 years ago

Thanks for the clear writeup. With point2, mapping the long reads to the trinity assembly, you mean the long reads are mapped to the assembled transcripts, or mapped to the genome and checked for overlapping on coordinates? Regarding poin t5, I had difficulties in running trinity and trimmomatic directly from the pipeline (singulariy images issues I think), therefore I performed these two steps outside funannotate. I see now why it could help to give the raw data also. Nonetheless, since its for training, it might be not such a big issue.

nextgenusfs commented 3 years ago

Long reads are mapped to the assembled trinity transcripts first, unmapped reads (not found in Trinity assemblies) are then aligned to genome and passed to PASA. If you have good Trinity assemblies and the long reads are from time points from similar growth conditions then these reads may not map well to the genome.