Closed HenrivdGeest closed 3 years ago
The current method if you provide both Illumina and long reads does this:
train
is to identify high quality evidence based models, and we will only pick the top model at each locus for training. Alternatively spiced transcripts are added back later at the funannotate update
step. blat
alignments of those data. PASA is not very fast due to the SQL database steps (in SQLite mode its actually only single threaded so can be really really slow if you give it all the data).funannotate train
the raw Illumina data, let it q-trim, and normalize for assembly -- but then have access to the raw data to determine which transcript at each locus is the most highly expressed. funannotate predict
to to the PASA mediated training, generate hints for Augustus, etc. So there is likely room for improvement here -- the long read data was added to the pipeline sort of supplemental data. Gene model predictions are only made from the ab initio predictors -- ie gene models are never called directly from transcript or protein alignments, so the focus on this step is to try to provide the best training set you can as that is where the primary gene models will come from.
Thanks for the clear writeup. With point2, mapping the long reads to the trinity assembly, you mean the long reads are mapped to the assembled transcripts, or mapped to the genome and checked for overlapping on coordinates? Regarding poin t5, I had difficulties in running trinity and trimmomatic directly from the pipeline (singulariy images issues I think), therefore I performed these two steps outside funannotate. I see now why it could help to give the raw data also. Nonetheless, since its for training, it might be not such a big issue.
Long reads are mapped to the assembled trinity transcripts first, unmapped reads (not found in Trinity assemblies) are then aligned to genome and passed to PASA. If you have good Trinity assemblies and the long reads are from time points from similar growth conditions then these reads may not map well to the genome.
Are you using the latest release? funannotate v1.8.4 # singularity image
Describe the bug my long read evidence does not seem to be mapped to the reference properly
What command did you issue? funannotate_v184.simg funannotate train -i /genome.fasta -l 1.renamed.fastq.gz -r 2.renamed.fastq.gz --cpus 128 --out train_part8 --no_normalize_reads --no_trimmomatic --max_intronlen 3000 --aligners minimap2 --pacbio_isoseq isoseq.fastq.gz --out train_part8
Logfiles
OS/Install Information
I am wondering how the isoseq/nanopore cdna data is used in the process. I see a bam file for example, isoseq_coordSorted.bam, which I thought contains the isoseq data. However, just a very few reads are aligned in that BAM file. From your code I did see that you use minimap -xsplice, which should be okay. Do you do any filtering on the bam or is there something off in my case? If I map the isoseq reads myself I do see much more alingment data, but so far I only see them at locations with also illumina rna-seq data present, so it seems that he
So, I am wondering if the alignment of long reads is performed properly in my case, If it is okay, it would be good to have the 'untouched' isoseq bam file, so that I can verify that the isoseq reads are used properly. Second, is there any way to give the full length cDNA reads (the complete isoseq or nanopore_cDNA set) more wheight than the trinity transcripts? Or do you think this does not matter for the training? Third, not really related to this post, but is trinity using the paired-end information for the gg-assembly process? In the bam files my IGV does not seem to visualize the paired-end information properly, although the sam flags are set to paired mapping.(can also be an hisat2 related issue)
image where there is overlap with illumina-rna-seq:(so isoseq.coordSorted.bam is empty)
Example where there is no overlap with illumina data ((so isoseq.coordSorted.bam is NOT empty. Minor: it seems the isoseq read is used twice by accident, the fragment names are also identical)