Closed oronoc1210 closed 6 years ago
Hi,
To generate the TE GTF, it is easiest to make a tab-delimited file containing the following information: chromosome, start, stop, strand and TE name. If you have information regarding the TE family or class, that would be even better. You can then run the makeTEgtf perl script to generate the output
From your GTF file, you could remove all the lines starting with #
, and then cut columns 1,4,5,7. You can then cut out column 9 and clean it up so that just the TE name is there, and then paste it to your other columns. You could also remove simple repeats (e.g. (AT)n and G-rich) from the annotations. You can then run makeTEgtf.pl -c 1 -s 2 -e 3 -o 4 -n RepeatMasker -t 5 -1
on the file with the cut columns to generate the GTF file.
I generated a GTF file from your GFF3 using the method described above, and it is available here. Please let me know if that works for you.
Thanks.
Thanks so much for your help! It works perfectly for me!
Best, Conor
Dear Olivertam, could you share the method generated by you, I could not find the URL,
I also want to GTF file from GFF file.
Hi,
Thank you for your interest in the software. I have updated the link to the perl script, but you can also get it here. Please don't hesitate to contact me if you have any questions about the process.
Thanks.
Dear Oliver Tam,
Greetings!
Thank you so for your reply. I generated TE.gtf, using the following script. perl makeTEgtf.pl -c 1 -s 2 -e 3 -o 4 -n RepeatMasker -t 5 -1 TEinput.txt>TE.gtf. It is perfect, thank you
While running the program, TEtranscripts show this error, Please see attached screen shot images.
I need count list for TE, but I got readcount only for Gene.
Please send me your valuble suggestion.
With regards
Ramakrishnan
------------------ Original ------------------ From: "Oliver @.>; Date: Thu, Dec 2, 2021 11:41 PM To: @.>; Cc: "Muthusamy @.>; @.>; Subject: Re: [mhammell-laboratory/TEtranscripts] TE gff3 to gtf? (#33)
Hi,
Thank you for your interest in the software. I have updated the link to the perl script, but you can also get it here. Please don't hesitate to contact me if you have any questions about the process.
Thanks.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
Hi Ramakrishnan,
I cannot see the screenshot (you may have to attach it to your response on Github).
Regarding only getting gene counts but no TE counts, please ensure that the chromosome names of your TE GTF and gene GTF match (no additional chr
in the name, for example).
Thanks.
Dear Professor Oliver Tam,
Thank you so for your quick reply. I am so sorry,
In this mail, I have attached a PPT file, in which I have pasted the screen shot images.
I also shown the screen shot images of TE GTF and gene GTF.
I think the chromosome name is correct.
Please send me your valuble suggestion.
With regards
Ramakrishnan
------------------ Original ------------------ From: "Oliver @.>; Date: Fri, Dec 3, 2021 11:06 AM To: @.>; Cc: "Muthusamy @.>; @.>; Subject: Re: [mhammell-laboratory/TEtranscripts] TE gff3 to gtf? (#33)
Hi,
I cannot see the screenshot (you may have to attach it to your response on Github). Regarding only getting gene counts but no TE counts, please ensure that the chromosome names of your TE GTF and gene GTF match (no additional chr in the name, for example).
Thanks.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
HI Ramakrishnan,
I'm afraid that none of the attachments will work if you respond via email. You will need to respond on Github itself. Alternatively, you can send me an email at tam at cshl dot edu
Thanks.
Dear Oliver Tam,
Thank you for your quick reply. While running the program, transcripts show some errors, I have pasted the screenshot images, I need a count list for TE, but I got a read count only for Gene; I also showed the screenshot images of TE GTF and gene GTF; Please send me your valuable suggestion. With regards Ramakrishnan
Hi Ramakrishnan,
The "errors" are warning messages from the pysam
module, and has no impact on TEtranscripts
(see #82).
Looking at your TE GTF file, it appears the gene_id
for all entries is RepeatMasker
. They should instead correspond to the TE name. You might need to check the TEinput.txt
file to ensure that the TE name is in the correct column.
It is also unusual that you had zero non-uniquely mapped reads in your BAM file. It is unclear what alignment parameters you are using, but given that many TE are repetitive elements in the genome, it is typically recommended to allow non-uniquely mapped reads when aligning (we typically allow up to 100 genomic alignments).
Hope this is helpful. Please feel free to let me know if you have further questions.
Thanks.
Dear Oliver Tam,
Many thanks for your response. The screenshot image shows the TE input file for generating the TE GTF file.
I used the following script
perl makeTEgtf.pl -c 1 -s 2 -e 3 -o 4 -n RepeatMasker -t 5 -1 TEinput.txt >TE.gtf
I am not sure whether I am correct.
with regards
Ramky
Hi Ramky,
If you look at column 6 (attributes), you will find information such as repeat name, class etc... You will need to parse out the information to have those as separate columns, and then run the makeTEgtf.pl
script with the correct columns for the various pieces of information.
Hope that is helpful. If you still have issues, you can provide the GFF3 file, and we can convert it to the TE GTF.
Thanks.
Dear Oliver Tam,
Many thanks for your kind response. Could I send you GFF3 personally,
with regards
Hi,
Feel free to send the GFF3 to tam at cshl dot edu, or attach it on Github.
Thanks.
Hi,
I have sent you the download links for the files, but I wanted to also note down the steps that were used to generate it. 1) Convert GFF into tab-separated file containing relevant info in various columns:
$ head -n 4 TE.gff
##gff-version 3
PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . ID=repeat_TE0000001;Target=Repeat_8240 948 2073;Class=ClassII/TIR/CACTA;PercDiv=4.8;PercDel=0.3;PercIns=0.6;
PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . ID=repeat_TE0000002;Target=Repeat_14470 4640 4882;Class=ClassI/LINE/L1;PercDiv=15.2;PercDel=7.2;PercIns=10.7;
PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . ID=repeat_TE0000003;Target=Repeat_24486 6739 6927;Class=ClassI/LTR/Copia;PercDiv=13.4;PercDel=1.6;PercIns=0.0;
$ sed '/^#/d;s/;/ /g;s/[A-Za-z]*=//g;s/ / /g;s/\//:/;s/\// /;s/?/Unknown/g' TE.gff | awk -F " " -v OFS=" " 'NF==18;NF<18{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$13,$14,$15,$16,$17}' > TE_input.txt
Note: All gaps in the code above are tabs
$ head -n 3 TEinput.txt
PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . repeat_TE0000001 Repeat_8240 948 2073 ClassII:TIR CACTA 4.8 0.3 0.6
PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . repeat_TE0000002 Repeat_14470 4640 4882 ClassI:LINE L1 15.2 7.2 10.7
PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . repeat_TE0000003 Repeat_24486 6739 6927 ClassI:LTR Copia 13.4 1.6 0.0
2) Generate TE GTF
$ perl makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 10 -n RepeatMasker -f 14 -C 13 -S 6 -1 TE_input.txt > TE.gtf
$ head -n 3 TE.gtf
PH01000409 RepeatMasker exon 1 1130 9321 + . gene_id "Repeat_8240"; transcript_id "Repeat_8240"; family_id "CACTA"; class_id "ClassII:TIR";
PH01000409 RepeatMasker exon 913 1163 850 - . gene_id "Repeat_14470"; transcript_id "Repeat_14470"; family_id "L1"; class_id "ClassI:LINE";
PH01000409 RepeatMasker exon 1130 1315 1235 + . gene_id "Repeat_24486"; transcript_id "Repeat_24486"; family_id "Copia"; class_id "ClassI:LTR";
Please let me know if there are any issues or further questions.
Thanks.
Dear Oliver Tam,
Thank you so much for spending your valuble time in generating TE GTF file.
It is very clear and more useful to my research group.
with regards
Ramky
------------------ Original ------------------ From: "Oliver @.>; Date: Fri, Dec 3, 2021 08:39 PM To: @.>; Cc: "Muthusamy @.>; @.>; Subject: Re: [mhammell-laboratory/TEtranscripts] TE gff3 to gtf? (#33)
Hi,
I have sent you the download links for the files, but I wanted to also note down the steps that were used to generate it.
Convert GFF into tab-separated file containing relevant info in various columns:
$ head -n 4 TE.gff ##gff-version 3 PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . ID=repeat_TE0000001;Target=Repeat_8240 948 2073;Class=ClassII/TIR/CACTA;PercDiv=4.8;PercDel=0.3;PercIns=0.6; PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . ID=repeat_TE0000002;Target=Repeat_14470 4640 4882;Class=ClassI/LINE/L1;PercDiv=15.2;PercDel=7.2;PercIns=10.7; PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . ID=repeat_TE0000003;Target=Repeat_24486 6739 6927;Class=ClassI/LTR/Copia;PercDiv=13.4;PercDel=1.6;PercIns=0.0; $ sed '/^#/d;s/;/ /g;s/[A-Za-z]*=//g;s/ / /g;s/\//:/;s/\// /;s/?/Unknown/g' TE.gff | awk -F " " -v OFS=" " 'NF==18;NF<18{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$13,$14,$15,$16,$17}' > TE_input.txt
Note: All gaps in the code above are tabs
$ head -n 3 TEinput.txt PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . repeat_TE0000001 Repeat_8240 948 2073 ClassII:TIR CACTA 4.8 0.3 0.6 PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . repeat_TE0000002 Repeat_14470 4640 4882 ClassI:LINE L1 15.2 7.2 10.7 PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . repeat_TE0000003 Repeat_24486 6739 6927 ClassI:LTR Copia 13.4 1.6 0.0
Generate TE GTF
$ perl makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 10 -n RepeatMasker -f 14 -C 13 -S 6 -1 TE_input.txt > TE.gtf $ head -n 3 TE.gtf PH01000409 RepeatMasker exon 1 1130 9321 + . gene_id "Repeat_8240"; transcript_id "Repeat_8240"; family_id "CACTA"; class_id "ClassII:TIR"; PH01000409 RepeatMasker exon 913 1163 850 - . gene_id "Repeat_14470"; transcript_id "Repeat_14470"; family_id "L1"; class_id "ClassI:LINE"; PH01000409 RepeatMasker exon 1130 1315 1235 + . gene_id "Repeat_24486"; transcript_id "Repeat_24486"; family_id "Copia"; class_id "ClassI:LTR";
Please let me know if there are any issues or further questions.
Thanks.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
Dear Oliver Tam,
Thank you very much for your kind response and your clear explanation.
This is really very clear and useful to me.
with regards
Ramky
Dear Oliver Tam,
As per your suggestion "As part of the --sortByPos parameter, TEcount is resorting the file by name (using samtools). It then tries to load the file (which is a hidden file in the same folder) for processing. I'm assuming that you have sufficient space on your system, but that's worth checking. One option is to resort the BAM file by read name (samtools sort -n ...), and then run TEcount without the --sortByPos parameter"
I run the TEtranscripts, but I got warning message,
As you mentiond "The "errors" are warning messages from the pysam module, and has no impact on TEtranscripts (see #82)."
Shall I ignore this "warning messages"?
Thank you
with regards
Ramky
Hi Ramky,
Yes, you can safely ignore those warning messages that you highlighted.
Thanks.
Dear Oliver Tam,
Thank you very much for your kind response.
with regards
Ramky
Dear Oliver Tam, Greetings!
I want to treat each copy of the TE as a distinct “gene”. Using R program, I merged ID with repeat name and generated TE.GTF. But output of the TE. GTF is not in the correct format.
I also learned that although the input file is in the text format, but the text file should be in the GFF3 file format; it should not in the random format.
In the R program, I subset only required information “Chrom column, start column, stop column, strand column, source, TE name, family column, class column”. After strand column, “dot column” this is also important.
After so many attempts, I generated GTF in the correct format, but I want to add underscore (_) between ID and repeat name.
I feel it will be better, if I use concatenate function in Linux. Could you show me one examples how to use concatenate function and how to add underscore (_) between ID and repeat name in Linux?
with regards
Ramky
Dear Oliver Tam, Greetings! What is the minimum system requirement to run TEtranscripts? I use WSL2 (windows subsystem for Linux), and my computer has 128 GB RAM. When I consider repeat name as gene, it took 3 hours to run 2 BAM files each with 10 GB size. However, while treating each copy of the TE as a distinct “gene”, TEtranscripts is taking long time, now 24 hours over, still it is running, I do not know how long it will take. Could you give me some suggestion?
With regards
Ramky
Hi Ramky,
I have created a new thread (#105) for your question regarding system requirements of TEtranscripts. Please refer to that for my response.
Regarding making a locus-level TE GTF, You can do the following:
1) Take your TE_input.txt
$ head -n 3 TE_input.txt
PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . repeat_TE0000001 Repeat_8240 948 2073 ClassII:TIR CACTA 4.8 0.3 0.6
PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . repeat_TE0000002 Repeat_14470 4640 4882 ClassI:LINE L1 15.2 7.2 10.7
PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . repeat_TE0000003 Repeat_24486 6739 6927 ClassI:LTR Copia 13.4 1.6 0.0
2) Use awk
to combine columns 9 and 10 with an underscore (_
)
$ awk -F " " -v OFS=" " '{print $1,$2,$3,$4,$5,$6,$7,$8,$9 "_" $10,$11,$12,$13,$14,$15,$16,$17}' TE_input.txt > TE_input2.txt
$ head -n 3 TE_input2.txt
PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . repeat_TE0000001_Repeat_8240 948 2073 ClassII:TIR CACTA 4.8 0.3 0.6
PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . repeat_TE0000002_Repeat_14470 4640 4882 ClassI:LINE L1 15.2 7.2 10.7
PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . repeat_TE0000003_Repeat_24486 6739 6927 ClassI:LTR Copia 13.4 1.6 0.0
Note that the gaps after -F
and -v OFS
are tabs
3) Run makeTEgtf.pl
$ makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 9 -n RepeatMasker -f 13 -C 12 -S 6 -1 TE_input2.txt > TE2.gtf
$ head -n 3 TE2.gtf
PH01000409 RepeatMasker exon 1 1130 9321 + . gene_id "repeat_TE0000001_Repeat_8240"; transcript_id "repeat_TE0000001_Repeat_8240"; family_id "CACTA"; class_id "ClassII:TIR";
PH01000409 RepeatMasker exon 913 1163 850 - . gene_id "repeat_TE0000002_Repeat_14470"; transcript_id "repeat_TE0000002_Repeat_14470"; family_id "L1"; class_id "ClassI:LINE";
PH01000409 RepeatMasker exon 1130 1315 1235 + . gene_id "repeat_TE0000003_Repeat_24486"; transcript_id "repeat_TE0000003_Repeat_24486"; family_id "Copia"; class_id "ClassI:LTR";
Let me know if you encounter any more issues.
Thanks.
Dear Oliver Tam,
Millions of thanks to you for your kind response and sending me the example.
The program is also completed, It took more than 36 hours.
with regards
Ramky
Thank you for the update.
As you can see from the logs, the creation of the TE index took the longest amount of time. This is part of the rationale of using pre-built indices for TElocal
All the best.
Dear Oliver Tam,
Many thanks for the information. I will try TElocal.
with regards
Ramky
Dear Oliver Tam,
How many read count file will be generated by TEtranscripts. In my analysis, I got only one read count file.
In the manual, it shows two output files for gene and TE.
with regards
Ramky
Hi,
There is one output file, the two tables are combined together.
Thanks.
Thank you so much for your kind reply.
Dear Oliver Tam,
Thank you for providing very useful software. TEtranscripts and TElocal are the same, but, in order to run TElocal, we have to provide, indexed GTF files.
Could you show me some example how to generate this file, I do not have any ideas about this.
I can also use TEcount instead of TEtranscripts, and generate only count table alone that can be used for differential analysis.
the only difference is that TEtrnscripts will peroform differentiall analsis,
TElocal use indexed GTF file.
with regards
Ramky
Dear Oliver Tam,
As per your suggestion, I run the following script,
perl makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 9 -n RepeatMasker -f 13 -C 12 -S 6 -1 TE_input2.txt > TEunigueID.gtf
but the program is running like this,
did you modify the Perl script, you asked me to run makeTEgtf_v2.pl, I do not have this,
please send me your suggestion.
with regards
Ramky
Hi Ramky,
If you are running TElocal
, you do not need to make a unique ID GTF file.
Please check that your input file is tab-delimited/separated, and not space separated.
Thanks.
Dear Oliver Tam,
Many thanks, yes my files are not tab-delimited,
to run, TElocal, can I use the same GTF that you showed me. I also got the errors for the output file.
with regards
Ramky
Hi Ramky,
Please use the original GTF (TE.gtf
) for building the TElocal
index.
Regarding your error, please ensure that you are not copying and pasting the command line, but attempt to type it out. There could be formatting errors that are introduced.
If that does not resolve the issue, we can generate the file for you.
Thanks.
Dear Oliver Tam,
Many thanks for your support.
I do not know why Tab key function is not working in my WSL2. So, I used the following script,
awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9 "_" $10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"}' TE_input.txt > TE_input2.txt
and run makeTEgtf.pl (Is it correct?)
According to my understanding, two kinds of analysis:
For example, in TE.gtf, Repeat_8240 is present 361 times, (total copy number). we can treat all copies as a single gene or we can treat each copy as a distinct gene
could you explain to me, what is locus-specific, TElocal is locus-specific analysis. I am sorry, I could not understand this,
With regards
Ramky
Hi Ramky,
If you are trying to insert a tab on the Linux command line, you will need to do the following combination: ctrl-V
, then tab
. That should hopefully work.
To answer your question: the way that RepeatMasker works is that it takes a set of TE consensus sequences, and then identify genomic regions that matches the consensus. When we treat "all copies of the same TE as a single gene", it means that we aggregate counts for all identified copies of the TE consensus in the genome as belonging to that TE consensus (which we sometimes designated as 'sub-famiily').
For a locus-specific analysis, each copy will be given its own count (rather than aggregated to the identified consensus), and so you will get values for each TE copy (which we sometimes designate as 'instance'). So in your example, each copy (361) of the Repeat_8240 will have a count (rather than an aggregated count).
Just a reminder, TElocal
uses the original GTF (your TE.gtf
) and not the modified one (TE2.gtf
) to build the index.
Thanks.
Dear Oliver Tam,
Thank you for your kind explanation and guidance.
with regards
Ramky
Dear Oliver Tam,
I think TEtranscripts also provides Locus-level TE RNA quantification and can be used any species.
am I correct? I think locus-level is each of copy TE
I do not know why they compared like this.
with regards
Hi,
TEtranscripts
was not designed for locus/copy level quantification, and thus not typically utilized this way. TElocal
is our locus-level implementation.
Thanks.
Dear Oliver Tam,
thank you, again I am confused with my GTF file. Now, I have two kinds of GTFs file.
locus/copy level quantification is confusing me.
with regards
Ramky
One of your GTF file (TE.gtf
) uses the repeat type/subfamily as the gene_id
, while the other (TE2.gtf
) used the TE copy name as the gene_id
. The former is what TEtranscripts
typically expects, whereas the other one was a "hack" to make TEtranscripts
do locus-level quantification. The "hack" has now been replaced by TElocal
What TEtranscripts
does is to quantify (and aggregates) all reads from a particular TE (e.g. Repeat_14470, which is a type of L1) throughout the genome. TElocal
(locus level) quantifies that particular copy of Repeat_14470 at PH01000409 at position 913-1163.
Thanks.
Dear Oliver Tam,
Thank you so much. This is results of TE2.gtf.
The repeat_8240 is present more than 300 times. if I use TE.gtf, I will get only count value for repeat_8240 by combining all 300 counts into single count because it uses the repeat type/subfamily as the gene_id.
Whereas If I use TE2.gtf, I will get read count for each epeat_8240, because in TE2.gtf, each epeat_8240 is treated as gene ID because of this "repeat_TE0004914".
Is it correct?
with regards
Ramky
Dear Oliver Tam,
If I use TElocal, I have to use TE.gtf, not TE2.gtf.
is it correct?
with regards
Ramky
Hi,
Yes, you are correct regarding the difference between TE.gtf
and TE2.gtf
.
TElocal
is designed to work with TE.gtf
to mimic TE2.gtf
(which is a "hack" forTEtranscripts
).
Thanks.
Dear Oliver Tam, Thank you. Ramky
who can provide BAM profile and GTF profile for gene and TE?thank you very much
Hi,
Are you asking for where you can get the input files? You generate the BAM from aligning your reads against the genome build of choice. We typically use STAR, and allow up to 100 mismatches (though that's based on mammalian genomes). For gene GTF, you can get them from UCSC, Ensembl or GENCODE, though that could possibly change the genome build that you're using (i.e. you want to use the FASTA with the same chromosome nomenclature as your GTF source). For TE GTF, you can get them here, and for certain species, you would need to get the corresponding GTF based on the source of your FASTA and GTF (e.g. hg38 for UCSC, GRCh38_Ensembl for Ensembl, GRCh38_GENCODE for GENCODE).
Thanks.
Thank you very much. Can you provide a BAM profiles for me?I just want to know how the software work now.Later maybe I will use it to do research. Also can you introduce how the transcript work?Because the software don't provide an example to show how it works.I have difficulty to understand it.
Hi,
If you want some test data, they are available here. For usage information, feel free to read our README. For the description of our algorithm, you can read the corresponding publication. For a proposed workflow, you can read this.
Please let me know if you have other questions.
Thanks.
Hello,
I'm working with Sorghum bicolor, and both my gene and TE annotation files were obtained from Phytozome as gff3 files. I had no trouble converting the gene gff3 file to gtf format using gffread from the cufflinks suite, but I'm having a lot more trouble converting the TE gff3 file to gtf format -- gffread just returns an empty file.
My TE gff3 file looks like this: Sbicolor_313_v3.1.repeatmasked_assembly_v3.0.gff3.gz what would you recommend I do?
Thank you for your help!