nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
314 stars 83 forks source link

Any way to improve the 3'UTR annotation #274

Closed MinjieHu closed 4 years ago

MinjieHu commented 5 years ago

Hi Jon,

Thanks for the excellent pipeline. I finished my genome annotation with it. It works pretty well when I applied to RNA-seq analysis. However, when I did the single cell RNA seq (based on 10x platform which mainly sequence the 3' end of genes), I found a lot reads will map to the 3' downstream of gene models predicted by the pipeline. This indicates the 3'UTR annotation is not very sensitive. I wonder is there any way to incorporated the single cell RNA-seq sequence information to update the UTR annotation.

Thanks ahead for any suggestions.

nextgenusfs commented 5 years ago

Currently funannotate is relying on PASA to do the UTR extensions. In theory this isn’t a difficult thing to do in the sense that you would align your data to the transcript and extend the UTR if some conditions were met, ie depth consistency, etc.

But rather than implement that de novo I’ve just ran PASA. There is flexibility in PASA and I have made it work with long reads from pacbio and nanopore. I’m not familiar with the 10X data, if you had some examples I can probably advise on how to incorporate it better. Is it transcripts, short read as, etc?

MinjieHu commented 5 years ago

Thanks for the quick responding. For the 10x single cell RNA seq data, it uses polyT to binding with the polyA of the mRNA, and sequencing the mRNA for each individual cell. It's also based on illumina short reads. However, the reads are not evenly distributed along the whole transcript. Based on my observation, there's always more reads in the 3' region than 5' region. That's why I can always observe some reads directly downstream the stop code. And when I looked into the annotation file, this region is not annotated as 3'UTR region. I don't know this uneven depth will cause problems for PASA annotation. I have all the reads mapped to the genome in a bam file. Most of the read length is 91bp. That would be great if I can incorporate this information to update the annotation. I don't have example on hand right now. If my explanation is still now clear, I will try to find some example tomorrow.

MinjieHu commented 5 years ago

I successed to elongate the 3' UTR with Stringtie transcript assembly and PASA and got the updated gene models (GFF file) from PASApipeline. However, PASApipeline seems convert some annotation characters into like %20 format (eg, ID=Xe_005162-T1;Parent=Xe_005162;Name=ATP-binding%20cassette%2C%20sub-family%20A%20%28ABC1%29%2C%20member%203). I wonder if there any way to integrate this PASA GFF file into the funannotate update pipeline and redo the gene annotation. I have tried to supply funannotate update with parameter --pasa_gff. But the program still start from the trinity assembly step and failed at that step. Here's the code I used funannotate update -f ../../xenia.contigs.FINAL.fasta -g ../xenia.gff3 --species xenia -o . --name Xe --sbt ../../funannotate_final/coral.sbt --pasa_gff ../pasa/xenia.gene_structures_post_PASA_updates.4225.gff3 Could you give me any suggestions? Thanks very much!

nextgenusfs commented 5 years ago

It is expecting to find a PASA SQL database, if it doesn't find one then it re-runs the alignment. So if these are just PE illumina reads, then the pipeline should work as is. The best results would be to start with funannotate train -- this will train the gene predictors using the RNA seq data (as well as any other data you give it as evidence. It aligns reads using Hisat2, aligns transcripts using minimap2, runs PASA genome guided, and then also runs Stringtie. Those combined data are then fed into funannotate predict - it will use the PASA gene models, stringtie, evidence, and then train Augustus with those data. You can then recapture UTRs from the RNA-seq data by running funannotate update. Note that if you follow the steps in their proper order, the scripts will re-use any data that is available, so funannotate update actually just requires the output funannotate folder as input and it will grab all the other data that it needs.

MinjieHu commented 5 years ago

Thanks for the explanation. For the single cell RNA-seq data, it's indeed PE illumina reads. However, only one ends containing the transcript sequence. The other is just used for barcoding. And also, the coverage is not evenly distributed along the whole transcript. Previously, I indeed did annotation based on the order you provided. The final gene model is quite comparable with other close related organisms. So I already did a lot research with the previously annotation version. I hope I can just update that annotation with these new single cell RNA seq data so that I don't need to make big change of my other analysis. Can I just feed funannotate previous annotation file and these single cell RNA-seq data to update the gene model?

nextgenusfs commented 5 years ago

Okay. Yes you can just run funannotate update. I guess then you would only give the script the reverse reads if the forward are non biological barcodes. You can pass these reads to the single option. It will run the same methods as in funannotate train and use your existing gene models as a starting point. I would try that first and see how those results compare. Note that funannotate update can be very slow when using SQLite in PASA because it can’t handle multiple threads, so it will take a long time in the PASA comparison steps.

MinjieHu commented 5 years ago

Thanks very much! I will give it a try. I will update when I get the results.