Closed Edison19991109 closed 1 year ago
Hello,
It is exciting to see that you would like to use this pipeline for long-read data. Unfortunately, the TEProf2 pipeline has been optimized for short-read data, so many of the steps would likely not work with long-read analysis pipelines.
However, I think you can at least use the pipeline to annotate and identify TE-gene transcripts that are in your data from long-read. To do this you could od the following steps:
WIth this, you will have a file, reference_merged_candidates.gff3_annotated_filtered_test_all
, that has all the transcripts annotated. Also, the Step10.RData
file will have the annotation in the annotatedcufftranscripts
object that can be loaded and studied in R. Then you can incorporate the quantification done by TALON or Isoquant for each transcript to identify those that are enriched in a specific population in your data.
-Nakul
Hi Nakul,
Thx a lot for your kind reply! Indeed I want to dig out the TE chimeric reads from my long-read sequencing data - so it seems the pipleine can fit my need. In this case, if I don't get it wrong, I can run the whole pipeline until the section 15. Optional further analysis with ballgown and skip the rest of the pipeline Usage Reference Guided?
With thanks, Jianning
Hi Nakul,
When I run the step (2) Run annotation on each GTF file, I got an empty file for
Thanks in advance, Jianning
Hello Jianning,
I should have been more clear in my initial response. First, you will need the requirements that include the reference files. If you are using the reference files provided based on gencode25, then we have all the files available.
Then, you will need to just run step 1. That is to set up the arguments.txt
file. This will have to have the correct path for all the reference files created.
Then, under reference guided mode, you will run step 1 so then it convert the gtf to a gff3 with the gffread
command. Then step 2 should work. If it is still giving you blank files, would you mind sending me or posting a few examples of en tries from the gff3 file to see what might be going wrong.
The empty file will not work for subsequent steps.
-Nakul
Hi Nakul,
Sorry I may have not described what I have done clearly. I first followed the section Requirements to install all the requirements, and then I followed the section Usage to this step (2) Run annotation on each GTF file and got the empty file I mentioned in the early reply. But I'm not so sure the difference between Usage and Usage Reference Guide. Do you mean I should run the Usage Reference Guide instead of Usage? Thx in advance!
Jianning
Hi Nakul,
I run the reference guided mode this afternoon and I still got an empty reference_merged_candidates.gff3_annotated_filtered_test_all
. Here is how my gff3 file looks like:
I used Stringtie2 to get the GTF file (I check it supports the long-read sequencing in its newest version), and then I used the gffread to convert it to gff3. I also uploaded the gff3 file to Onedrive in case you need it:
Thanks, Jianning
Ah, I think I see what the issue is. The normal mode in Usage is for short-read data. It has a step that uses read information to increase confidence in the transcripts that are assembled. However, for long-read data, the assembly is quite confident, and thus the steps that are normally run do not really improve the confidence in the transcripts.
Instead, the reference-guided mode is when you already have a confident assembly, and you just want to annotate the transcripts. I think in your case this would be the most useful. To be able to do that, I would run the steps after Usage Reference Guide. This would allow you to annotate the transcripts and generate the files I mentioned above. An important difference is that the script that is run, rmskhg38_annotate_gtf_update_test_tpm_cuff.py
, is different than the one under Usage.
As mentioned in the first comment, this would get the annotations for the transcripts. Now, I am not sure if all the subsequent steps will run smoothly with Stringtie2, but I do not see why it would not. I think it can still create the ballgown files with long-read data.
Please let me know if running rmskhg38_annotate_gtf_update_test_tpm_cuff.py
creates a file that is not empty.
-Nakul
Hi Nakul,
Thanks a lot for your help. I tried to run the rmskhg38_annotate_gtf_update_test_tpm_cuff.py
with command rmskhg38_annotate_gtf_update_test_tpm_cuff.py SRR10267008_reference_merged_candidates.gff3
. It still got an empty SRR10267008_reference_merged_candidates.gff3_annotated_filtered_test_all
. I also tried to run the next step mergeAnnotationProcess_Ref.R
with the unmpty file SRR10267008_reference_merged_candidates.gff3_annotated_test_all
and it exited with an error. It seems like rmskhg38_annotate_gtf_update_test_tpm_cuff.py
filters all the transcripts. I am not sure what happens here. I upload the unempty file SRR10267008_reference_merged_candidates.gff3_annotated_test_all
here for your reference: https://zjuintl-my.sharepoint.com/:u:/g/personal/jianning_18_intl_zju_edu_cn/EdRAAm7TIAZAugS9H-TIbs4ByG7xHFMZuWIcmSOOyMVmWA?e=1cDGpO
With thanks, Jianning
Hello Jianning,
I ran the pipeline on your file, and I think, I realized the problem. For some reason, most of the transcripts are single-exon transcripts. In the gff3 file, each transcript has one line for the transcript and then one exon line. There are no splice junctions for most of the transcripts. Thus, without splicing the transcripts are filtered out since the pipeline requires spliced transcripts.
Can you post the original gtf? I am not sure if this is an issue with the gtf file or with gffread creating only single-exon transcripts.
-Nakul
Hi Nakul,
Thank you so much for running the pipeline on my file on the weekend! Inspired by your reply, I think there may be two reasons that I got an empty file:
ax splice
;-L
for stringtie2.These may result in not outputing splice events in the gtf file. I uploaded my gtf file to Onedrive for your reference: https://zjuintl-my.sharepoint.com/:u:/g/personal/jianning_18_intl_zju_edu_cn/Ede4r2YX2r1FgpDOF8VXDzQB-WA5o6UsM0kMhd7BwHHxSA?e=0Firwp
And here are the commands I used:
samtools view -h SRR10267008.bam | stringtie - -G /WORK/home/shexh/KJN/Rotation_2/Reference/Annotation/gencode.vM32.chr_patch_hapl_scaff.annotation.gtf -o SRR10267008.gtf -m 100 -c 1
gffread -E /WORK/home/shexh/KJN/Rotation_2/embryo_Pacbio/SRR10267008.gtf -o- > SRR10267008_reference_merged_candidates.gff3
rmskhg38_annotate_gtf_update_test_tpm_cuff.py SRR10267008_reference_merged_candidates.gff3
I am trying to rerun the minimap2 and stringtie2 to see how it works. Hopefully this can solve the bug :)
With thanks, Jianning
Hi Nakul,
Finally the SRR10267008_reference_merged_candidates.gff3_annotated_filtered_test_all
file is not empty! Thank you so much for kindly help me with it. With our effort, your pipeline can be safely extended to the long read sequencing 😉
Before I close this issue, just a few more things I may need to check with you:
rmskhg38_annotate_gtf_update_test_tpm_cuff.py
, it keeps returning error:
So I change the annotateintrons = True
to False
. Does it have any influence on the following analysis?Step10.RData
and it seems like the variable annotatedcufftranscripts
have the information I need. Do I find the right data that fits my need?annotatedcufftranscripts
, it has gene 1
, gene 2
and class TE
. Does it represent the genes and TE that are in the same transcript? What does chromtrans
and other trans information mean?There are other minor issues like mergeAnnotationProcess_Ref.R
requires Xmisc
cannot be installed directly because it has been removed on the R CRAN, and option -g
does not seem to work for mergeAnnotationProcess_Ref.R
. But they does not seem to have a big impact. I can make another issue to discuss how to install Xmisc
if you find it useful :)
With thanks, Jianning
Hi both I hope it's okay that I join this thread - amazing timing, I just started analyzing some PacBio data yesterday using your pipeline.
About Xmisc, it is true it has been removed from cran. I first downloaded the tar.gz from here: https://cran.r-project.org/src/contrib/Archive/Xmisc/. I just took the newest one (https://cran.r-project.org/src/contrib/Archive/Xmisc/Xmisc_0.2.1.tar.gz). Then, in R, I ran install.packages("Xmisc_0.2.1.tar.gz", repos = NULL, type="source")
and it worked fine. Hope this helps.
Another thing, I also had to use the option -L for stringtie2 to get out a significant number of transcripts.
I am currently in step (6) where I get the bug that filter_read_stats.txt
only has four columns where R is expecting five. I wonder if it's an issue because I use single-end for step 5? (I used commandsmax_speed_se.py
)
Best regards Nanna
Hi both I hope it's okay that I join this thread - amazing timing, I just started analyzing some PacBio data yesterday using your pipeline.
About Xmisc, it is true it has been removed from cran. I first downloaded the tar.gz from here: https://cran.r-project.org/src/contrib/Archive/Xmisc/. I just took the newest one (https://cran.r-project.org/src/contrib/Archive/Xmisc/Xmisc_0.2.1.tar.gz). Then, in R, I ran
install.packages("Xmisc_0.2.1.tar.gz", repos = NULL, type="source")
and it worked fine. Hope this helps.Another thing, I also had to use the option -L for stringtie2 to get out a significant number of transcripts. I am currently in step (6) where I get the bug that
filter_read_stats.txt
only has four columns where R is expecting five. I wonder if it's an issue because I use single-end for step 5? (I usedcommandsmax_speed_se.py
)Best regards Nanna
Hi Nanna,
Yep I also installed the Xmisc by using install.packages("Xmisc_0.2.1.tar.gz", repos = NULL, type="source")
. Currently I'm trying to retrieve quantification information of candidate chimeric reads. But I found that when I used the gffread
as the READMe suggested, the transcript and gene ID was all in 'STRG XXX', and I cannot find the relevant quantification information. I use another format-convert program and I'm trying to modify the rmskhg38_annotate_gtf_update_test_tpm_cuff.py
to read in the new gff3 file.
Hope it helps, Jianning
Hello,
Thank you for commenting on your progress. I think we will have to move on from Xmisc
, but as noted, there is a workaround using devtools and the archived version as seen in the README.
The annotateintrons
option is not used for any of the quantification and identification. It is a feature to help with seeing if the first intron junction has been annotated previously in the reference file, but it is not used in subsequent steps. I think changing the default to false makes sense, thank you for catching that!
@Edison19991109 to answer your other two questions:
annotatedcufftranscripts
R object has all the information you need in terms of the annotation of the transcript.gene 1
is the gene that the start is annotated in. Then, the pipeline finds when the transcript overlaps with a gene exon. Most of the time, gene 1
and gene 2
are the same since the transcript starts in the gene intron and then splices into an exon. The classTE
is the class of the transposable element that is annotated as the start of the transcript. The final output of the pipeline is an excel file with more meaningful variable names. The intermediate tables in the .RData files have various data used in different parts of the pipeline. The chromtrans
variable contains transcript structure information used in annotating the transcript.
Essentially, to find the TE-gene chimeric transcripts, you should look for the ones where `exonintron1' is "None" or "intron." Then exonintron2 is "exon" and gene2 is a gene. That means that the transcript starts in an intergenic or intronic TE and splices into a gene.
I think a solution to this is to print the annotations of the TE-gene transcripts in an easier to read excel file earlier in the pipeline. You can use this to annotate the long-read transcripts and then use other software for quantification.
Hope that helps!
-Nakul
@nannabarnkob thank you for your interest in the pipeline.
As mentioned earlier in the thread, this pipeline was made for short-read data, but many of the steps are not needed for long-read data. Thus, with long read data it will work better to use the reference-guided mode: https://github.com/twlab/TEProf2Paper#usage-reference-guided
You should align your reads and use a pipeline of your choice to generate a gtf file that can be annotated with that portion of the pipeline. Many pipelines exist for this such as Talon and Isoseq. Then you can use reference-guided mode to annotate it. Unfortunately, I am not sure if the quantification portion will work since this pipeline was made for short-read data.
-Nakul
Hello,
Thank you for commenting on your progress. I think we will have to move on from
Xmisc
, but as noted, there is a workaround using devtools and the archived version as seen in the README.The
annotateintrons
option is not used for any of the quantification and identification. It is a feature to help with seeing if the first intron junction has been annotated previously in the reference file, but it is not used in subsequent steps. I think changing the default to false makes sense, thank you for catching that!@Edison19991109 to answer your other two questions: 2. The
annotatedcufftranscripts
R object has all the information you need in terms of the annotation of the transcript. 3. Essentially, the pipeline works by annotating the start of the transcript first. So if it is intergenic, all of this will be "None". If it starts in an intron, then thegene 1
is the gene that the start is annotated in. Then, the pipeline finds when the transcript overlaps with a gene exon. Most of the time,gene 1
andgene 2
are the same since the transcript starts in the gene intron and then splices into an exon. TheclassTE
is the class of the transposable element that is annotated as the start of the transcript.The final output of the pipeline is an excel file with more meaningful variable names. The intermediate tables in the .RData files have various data used in different parts of the pipeline. The
chromtrans
variable contains transcript structure information used in annotating the transcript.Essentially, to find the TE-gene chimeric transcripts, you should look for the ones where `exonintron1' is "None" or "intron." Then exonintron2 is "exon" and gene2 is a gene. That means that the transcript starts in an intergenic or intronic TE and splices into a gene.
I think a solution to this is to print the annotations of the TE-gene transcripts in an easier to read excel file earlier in the pipeline. You can use this to annotate the long-read transcripts and then use other software for quantification.
Hope that helps!
-Nakul
Hi Nakul,
Many thanks for your kind help! I have managed to solve the bugs and get the result 👍
Thanks a lot!
Jianning
Hi there,
I planned to use the pipeline for pacbio long-reads sequencing data. I have mapped the data using minimap2. Can I apply TEProf2 to TE finding? Thx in advance :)
Jianning