long reads transcriptome evidence

iwangtoknow commented 6 years ago

Dear Jon, I have both short reads and long reads transcriptome data, but funannotate don't have a long reads option, I noticed that you have described a method in the paper Extreme sensitivity to ultraviolet light in the fungal pathogen causing white-nose syndrome of bats

Gene prediction for the 6 non-pathogenic Pseudogymnoascus species was
conducted similarly, with some changes to the pipeline because the RNA-sequencing data
obtained from the Ion Torrent PGM did not contain enough reads to adequately run the PASA
pipeline. Thus, RNA-seq data was converted to transcripts using genome guided Trinity,
AUGUSTUS was trained using BUSCO 12 , and GeneMark-ES v4.32 was used for ab initio gene
prediction. Additionally, ESTs downloaded from JGI Mycososm
(http://genome.jgi.doe.gov/programs/fungi/index.jsf) for closely related species (Leotiomycetes)
were used in addition to the Trinity transcripts for generation of transcript alignments using
GMAP. Finally, EVidenceModeler v1.1.1 5 was used to combine data from protein alignments,
transcript alignments, and ab initio predictions to construct high quality evidence based gene
models.

So can funannotate support long reads transcriptome seq now? How can I combine Iso-seq data in funannotate pipeline?

Thanks for your help.

nextgenusfs commented 6 years ago

I have not tested this specifically as I have not seen any iso-seq data, but it should work at least rudimentary at this point. We can update it so supported more broadly as well. So basically there isn't a reason to "assemble" the iso-seq data (at least I don't think there should be as it is technically 1 read = 1 transcript correct?). So they just need to be mapped to the genome. Best way to do this is to use minimap2. This isn't built into funannotate yet (but can be if it works well), but you should just need to run something like:

minimap2 -ax splice genome.fa iso-seq_reads.fq | samtools view -bS - | samtools sort -o iso_sort.bam -

This will generate a coord-sorted BAM of the alignments which you could then combine with your small read RNA-seq alignments (i.e. the hisat2 alignments from funannotate train) and pass to the funannotate predict --rna_bam option. You can then also try to pass the iso-seq data (in FASTA) format to the funannotate predict --transcript_evidence option -- which will then map it to the genome and use those data for EVM consensus gene model prediction. Let me know if this works and I can add these steps directly to funannotate to make it easier to use these data.

iwangtoknow commented 6 years ago

OK, I'll try both. For I have full length non-chimeric fasta file from Pacbio Iso-seq, I would like the second one. I also have reads of insert fastq reads file, I'll also try the former.

iwangtoknow commented 6 years ago

Dear Jon, I tried to pass the full length non-chimeric fasta file from Iso-seq to the funannotate predict --transcript_evidence option and it works. I'm trying annotate and an old issue appeared when dealing with antiSMASH local results.

nextgenusfs commented 6 years ago

Thanks, that's good to know. Would you be able to construct a test set with the data? Something like 5-6 scaffolds and then a subset of the iso-seq reads that map to those scaffolds? Then I could use that data to do some more tests at explicitly supporting the long read RNA-seq data. Something like mapping the reads to your test data with minimap2 (as above), and then extracting the reads from the BAM file that map would be a way to subset the data.

iwangtoknow commented 6 years ago

Dear Jon, We performed Iso-seq, but only got some data (total full length non-chimeric fasta is 20.7M), we are proforming a second run. After I have the second part of data, I can send full data to you. Just tell me what you want. This is a A. nidulans Very thanks.

nextgenusfs commented 6 years ago

Okay great, that sounds perfect A. nidulans is one of my favorite fungi...

iwangtoknow commented 6 years ago

pacbio_gap Dear Jon, I used the data what I have now finished 1st structure & functional annotation, please take a look at the upper fig, Iso-seq raw reads is long enough to overlap a full length transcripts, but contain a lot of gaps (most case 1-2 nucleotide acid ) in the raw reads, (maybe after self-fix that will be fine). GMAP bam And structure annotation results many introns in a gene. too_many_intron

iwangtoknow commented 6 years ago

super_long_gene9k There is not a gene right?

nextgenusfs commented 6 years ago

You can also load in some of the preliminary GFF3 annotations to try to see what is happening. It looks like there might be a transcript there but hard to tell for sure if it makes a complete gene model. In the predict_misc folder you can load in the gene_predictions.gff3 file which will contain all of the predictions that went into evidence modeler. Could also look at even.round1.gff3 which is the output of evidence modeled prior to any filtering. Sometimes models get removed because they are repeats/transposon-like or in repeat dense regions. EVM typically doesn’t call genes if there is only 1 type of evidence and no gene model prediction. So maybe we need to incorporate the iso-seq into training Augustus - I was working on this last night actually. Just haven’t tested as I don’t have any reliable test data.

iwangtoknow commented 6 years ago

Dear Jon, I'll take your advice fully. I don't have experience to manually edit genes on the basis of reads mapping info. I'm not fully understand. I'll contact you at the first time I prepared to send data to you. I checked the long reads mapping bam in IGV, my Iso-seq reads is rare indeed. sad super_long_gene9k

nextgenusfs commented 6 years ago

So in this example, there aren't any gene predictions in this region thus EVM didn't predict any gene models. Running the data back through PASA (via funannotate update) might help catch some of these genes. I need to integrate the long reads into the update command yet.

nextgenusfs / funannotate

long reads transcriptome evidence #175