twlab / TEProf2Paper

TEProf2 Pipeline used to find promoters and predict protein sequences from RNA-sequencing data
Other
18 stars 6 forks source link

Applied to Long-read Sequencing Data #2

Closed Edison19991109 closed 1 year ago

Edison19991109 commented 1 year ago

Hi there,

I planned to use the pipeline for pacbio long-reads sequencing data. I have mapped the data using minimap2. Can I apply TEProf2 to TE finding? Thx in advance :)

Jianning

nakul2234 commented 1 year ago

Hello,

It is exciting to see that you would like to use this pipeline for long-read data. Unfortunately, the TEProf2 pipeline has been optimized for short-read data, so many of the steps would likely not work with long-read analysis pipelines.

However, I think you can at least use the pipeline to annotate and identify TE-gene transcripts that are in your data from long-read. To do this you could od the following steps:

  1. Use pipelines such as TALON or Isoquant to get a GTF of all transcripts from long-read supported from your data
  2. Use the reference-guided mode of the pipeline (https://github.com/twlab/TEProf2Paper#usage-reference-guided)
    • You can run steps 1, 2, and 3b.

WIth this, you will have a file, reference_merged_candidates.gff3_annotated_filtered_test_all, that has all the transcripts annotated. Also, the Step10.RData file will have the annotation in the annotatedcufftranscripts object that can be loaded and studied in R. Then you can incorporate the quantification done by TALON or Isoquant for each transcript to identify those that are enriched in a specific population in your data.

-Nakul

Edison19991109 commented 1 year ago

Hi Nakul,

Thx a lot for your kind reply! Indeed I want to dig out the TE chimeric reads from my long-read sequencing data - so it seems the pipleine can fit my need. In this case, if I don't get it wrong, I can run the whole pipeline until the section 15. Optional further analysis with ballgown and skip the rest of the pipeline Usage Reference Guided?

With thanks, Jianning

Edison19991109 commented 1 year ago

Hi Nakul,

When I run the step (2) Run annotation on each GTF file, I got an empty file for .gtf_annotated_filtered_test_all. It seems like the next step requires this file. Can I continue with .gtf_annotated_test_all?

Thanks in advance, Jianning

nakul2234 commented 1 year ago

Hello Jianning,

I should have been more clear in my initial response. First, you will need the requirements that include the reference files. If you are using the reference files provided based on gencode25, then we have all the files available.

Then, you will need to just run step 1. That is to set up the arguments.txt file. This will have to have the correct path for all the reference files created.

Then, under reference guided mode, you will run step 1 so then it convert the gtf to a gff3 with the gffread command. Then step 2 should work. If it is still giving you blank files, would you mind sending me or posting a few examples of en tries from the gff3 file to see what might be going wrong.

The empty file will not work for subsequent steps.

-Nakul

Edison19991109 commented 1 year ago

Hi Nakul,

Sorry I may have not described what I have done clearly. I first followed the section Requirements to install all the requirements, and then I followed the section Usage to this step (2) Run annotation on each GTF file and got the empty file I mentioned in the early reply. But I'm not so sure the difference between Usage and Usage Reference Guide. Do you mean I should run the Usage Reference Guide instead of Usage? Thx in advance!

Jianning

Edison19991109 commented 1 year ago

Hi Nakul,

I run the reference guided mode this afternoon and I still got an empty reference_merged_candidates.gff3_annotated_filtered_test_all. Here is how my gff3 file looks like:

image

I used Stringtie2 to get the GTF file (I check it supports the long-read sequencing in its newest version), and then I used the gffread to convert it to gff3. I also uploaded the gff3 file to Onedrive in case you need it:

https://zjuintl-my.sharepoint.com/:u:/g/personal/jianning_18_intl_zju_edu_cn/EWgl50dlISdAhWsjygyJaVwB_FFkLpS2T-Kh_ZEfe8QxiA?e=TaeBhs

Thanks, Jianning

nakul2234 commented 1 year ago

Ah, I think I see what the issue is. The normal mode in Usage is for short-read data. It has a step that uses read information to increase confidence in the transcripts that are assembled. However, for long-read data, the assembly is quite confident, and thus the steps that are normally run do not really improve the confidence in the transcripts.

Instead, the reference-guided mode is when you already have a confident assembly, and you just want to annotate the transcripts. I think in your case this would be the most useful. To be able to do that, I would run the steps after Usage Reference Guide. This would allow you to annotate the transcripts and generate the files I mentioned above. An important difference is that the script that is run, rmskhg38_annotate_gtf_update_test_tpm_cuff.py, is different than the one under Usage.

As mentioned in the first comment, this would get the annotations for the transcripts. Now, I am not sure if all the subsequent steps will run smoothly with Stringtie2, but I do not see why it would not. I think it can still create the ballgown files with long-read data.

Please let me know if running rmskhg38_annotate_gtf_update_test_tpm_cuff.py creates a file that is not empty.

-Nakul

Edison19991109 commented 1 year ago

Hi Nakul,

Thanks a lot for your help. I tried to run the rmskhg38_annotate_gtf_update_test_tpm_cuff.py with command rmskhg38_annotate_gtf_update_test_tpm_cuff.py SRR10267008_reference_merged_candidates.gff3. It still got an empty SRR10267008_reference_merged_candidates.gff3_annotated_filtered_test_all. I also tried to run the next step mergeAnnotationProcess_Ref.R with the unmpty file SRR10267008_reference_merged_candidates.gff3_annotated_test_all and it exited with an error. It seems like rmskhg38_annotate_gtf_update_test_tpm_cuff.py filters all the transcripts. I am not sure what happens here. I upload the unempty file SRR10267008_reference_merged_candidates.gff3_annotated_test_all here for your reference: https://zjuintl-my.sharepoint.com/:u:/g/personal/jianning_18_intl_zju_edu_cn/EdRAAm7TIAZAugS9H-TIbs4ByG7xHFMZuWIcmSOOyMVmWA?e=1cDGpO

With thanks, Jianning

nakul2234 commented 1 year ago

Hello Jianning,

I ran the pipeline on your file, and I think, I realized the problem. For some reason, most of the transcripts are single-exon transcripts. In the gff3 file, each transcript has one line for the transcript and then one exon line. There are no splice junctions for most of the transcripts. Thus, without splicing the transcripts are filtered out since the pipeline requires spliced transcripts.

Can you post the original gtf? I am not sure if this is an issue with the gtf file or with gffread creating only single-exon transcripts.

-Nakul

Edison19991109 commented 1 year ago

Hi Nakul,

Thank you so much for running the pipeline on my file on the weekend! Inspired by your reply, I think there may be two reasons that I got an empty file:

  1. I used minimap2 to map the long read data which I didn't used the option ax splice;
  2. I didn't used the option -L for stringtie2.

These may result in not outputing splice events in the gtf file. I uploaded my gtf file to Onedrive for your reference: https://zjuintl-my.sharepoint.com/:u:/g/personal/jianning_18_intl_zju_edu_cn/Ede4r2YX2r1FgpDOF8VXDzQB-WA5o6UsM0kMhd7BwHHxSA?e=0Firwp

And here are the commands I used: samtools view -h SRR10267008.bam | stringtie - -G /WORK/home/shexh/KJN/Rotation_2/Reference/Annotation/gencode.vM32.chr_patch_hapl_scaff.annotation.gtf -o SRR10267008.gtf -m 100 -c 1

gffread -E /WORK/home/shexh/KJN/Rotation_2/embryo_Pacbio/SRR10267008.gtf -o- > SRR10267008_reference_merged_candidates.gff3

rmskhg38_annotate_gtf_update_test_tpm_cuff.py SRR10267008_reference_merged_candidates.gff3

I am trying to rerun the minimap2 and stringtie2 to see how it works. Hopefully this can solve the bug :)

With thanks, Jianning

Edison19991109 commented 1 year ago

Hi Nakul,

Finally the SRR10267008_reference_merged_candidates.gff3_annotated_filtered_test_all file is not empty! Thank you so much for kindly help me with it. With our effort, your pipeline can be safely extended to the long read sequencing 😉

Before I close this issue, just a few more things I may need to check with you:

  1. When I run the subsequent script rmskhg38_annotate_gtf_update_test_tpm_cuff.py, it keeps returning error: image So I change the annotateintrons = True to False. Does it have any influence on the following analysis?
  2. I want to compare the TE chimeric transcripts across different samples. I check Step10.RData and it seems like the variable annotatedcufftranscripts have the information I need. Do I find the right data that fits my need?
  3. For annotatedcufftranscripts, it has gene 1, gene 2 and class TE. Does it represent the genes and TE that are in the same transcript? What does chromtrans and other trans information mean?

There are other minor issues like mergeAnnotationProcess_Ref.R requires Xmisc cannot be installed directly because it has been removed on the R CRAN, and option -g does not seem to work for mergeAnnotationProcess_Ref.R. But they does not seem to have a big impact. I can make another issue to discuss how to install Xmisc if you find it useful :)

With thanks, Jianning

nannabarnkob commented 1 year ago

Hi both I hope it's okay that I join this thread - amazing timing, I just started analyzing some PacBio data yesterday using your pipeline.

About Xmisc, it is true it has been removed from cran. I first downloaded the tar.gz from here: https://cran.r-project.org/src/contrib/Archive/Xmisc/. I just took the newest one (https://cran.r-project.org/src/contrib/Archive/Xmisc/Xmisc_0.2.1.tar.gz). Then, in R, I ran install.packages("Xmisc_0.2.1.tar.gz", repos = NULL, type="source") and it worked fine. Hope this helps.

Another thing, I also had to use the option -L for stringtie2 to get out a significant number of transcripts. I am currently in step (6) where I get the bug that filter_read_stats.txt only has four columns where R is expecting five. I wonder if it's an issue because I use single-end for step 5? (I used commandsmax_speed_se.py)

Best regards Nanna

Edison19991109 commented 1 year ago

Hi both I hope it's okay that I join this thread - amazing timing, I just started analyzing some PacBio data yesterday using your pipeline.

About Xmisc, it is true it has been removed from cran. I first downloaded the tar.gz from here: https://cran.r-project.org/src/contrib/Archive/Xmisc/. I just took the newest one (https://cran.r-project.org/src/contrib/Archive/Xmisc/Xmisc_0.2.1.tar.gz). Then, in R, I ran install.packages("Xmisc_0.2.1.tar.gz", repos = NULL, type="source") and it worked fine. Hope this helps.

Another thing, I also had to use the option -L for stringtie2 to get out a significant number of transcripts. I am currently in step (6) where I get the bug that filter_read_stats.txt only has four columns where R is expecting five. I wonder if it's an issue because I use single-end for step 5? (I used commandsmax_speed_se.py)

Best regards Nanna

Hi Nanna,

Yep I also installed the Xmisc by using install.packages("Xmisc_0.2.1.tar.gz", repos = NULL, type="source") . Currently I'm trying to retrieve quantification information of candidate chimeric reads. But I found that when I used the gffread as the READMe suggested, the transcript and gene ID was all in 'STRG XXX', and I cannot find the relevant quantification information. I use another format-convert program and I'm trying to modify the rmskhg38_annotate_gtf_update_test_tpm_cuff.py to read in the new gff3 file.

Hope it helps, Jianning

nakul2234 commented 1 year ago

Hello,

Thank you for commenting on your progress. I think we will have to move on from Xmisc, but as noted, there is a workaround using devtools and the archived version as seen in the README.

The annotateintrons option is not used for any of the quantification and identification. It is a feature to help with seeing if the first intron junction has been annotated previously in the reference file, but it is not used in subsequent steps. I think changing the default to false makes sense, thank you for catching that!

@Edison19991109 to answer your other two questions:

  1. The annotatedcufftranscripts R object has all the information you need in terms of the annotation of the transcript.
  2. Essentially, the pipeline works by annotating the start of the transcript first. So if it is intergenic, all of this will be "None". If it starts in an intron, then the gene 1 is the gene that the start is annotated in. Then, the pipeline finds when the transcript overlaps with a gene exon. Most of the time, gene 1 and gene 2 are the same since the transcript starts in the gene intron and then splices into an exon. The classTE is the class of the transposable element that is annotated as the start of the transcript.

The final output of the pipeline is an excel file with more meaningful variable names. The intermediate tables in the .RData files have various data used in different parts of the pipeline. The chromtrans variable contains transcript structure information used in annotating the transcript.

Essentially, to find the TE-gene chimeric transcripts, you should look for the ones where `exonintron1' is "None" or "intron." Then exonintron2 is "exon" and gene2 is a gene. That means that the transcript starts in an intergenic or intronic TE and splices into a gene.

I think a solution to this is to print the annotations of the TE-gene transcripts in an easier to read excel file earlier in the pipeline. You can use this to annotate the long-read transcripts and then use other software for quantification.

Hope that helps!

-Nakul

nakul2234 commented 1 year ago

@nannabarnkob thank you for your interest in the pipeline.

As mentioned earlier in the thread, this pipeline was made for short-read data, but many of the steps are not needed for long-read data. Thus, with long read data it will work better to use the reference-guided mode: https://github.com/twlab/TEProf2Paper#usage-reference-guided

You should align your reads and use a pipeline of your choice to generate a gtf file that can be annotated with that portion of the pipeline. Many pipelines exist for this such as Talon and Isoseq. Then you can use reference-guided mode to annotate it. Unfortunately, I am not sure if the quantification portion will work since this pipeline was made for short-read data.

-Nakul

Edison19991109 commented 1 year ago

Hello,

Thank you for commenting on your progress. I think we will have to move on from Xmisc, but as noted, there is a workaround using devtools and the archived version as seen in the README.

The annotateintrons option is not used for any of the quantification and identification. It is a feature to help with seeing if the first intron junction has been annotated previously in the reference file, but it is not used in subsequent steps. I think changing the default to false makes sense, thank you for catching that!

@Edison19991109 to answer your other two questions: 2. The annotatedcufftranscripts R object has all the information you need in terms of the annotation of the transcript. 3. Essentially, the pipeline works by annotating the start of the transcript first. So if it is intergenic, all of this will be "None". If it starts in an intron, then the gene 1 is the gene that the start is annotated in. Then, the pipeline finds when the transcript overlaps with a gene exon. Most of the time, gene 1 and gene 2 are the same since the transcript starts in the gene intron and then splices into an exon. The classTE is the class of the transposable element that is annotated as the start of the transcript.

The final output of the pipeline is an excel file with more meaningful variable names. The intermediate tables in the .RData files have various data used in different parts of the pipeline. The chromtrans variable contains transcript structure information used in annotating the transcript.

Essentially, to find the TE-gene chimeric transcripts, you should look for the ones where `exonintron1' is "None" or "intron." Then exonintron2 is "exon" and gene2 is a gene. That means that the transcript starts in an intergenic or intronic TE and splices into a gene.

I think a solution to this is to print the annotations of the TE-gene transcripts in an easier to read excel file earlier in the pipeline. You can use this to annotate the long-read transcripts and then use other software for quantification.

Hope that helps!

-Nakul

Hi Nakul,

Many thanks for your kind help! I have managed to solve the bugs and get the result 👍

Thanks a lot!

Jianning