TE gff3 to gtf? - Githubissues

oronoc1210 commented 6 years ago

Hello,

I'm working with Sorghum bicolor, and both my gene and TE annotation files were obtained from Phytozome as gff3 files. I had no trouble converting the gene gff3 file to gtf format using gffread from the cufflinks suite, but I'm having a lot more trouble converting the TE gff3 file to gtf format -- gffread just returns an empty file.

My TE gff3 file looks like this: Sbicolor_313_v3.1.repeatmasked_assembly_v3.0.gff3.gz what would you recommend I do?

Thank you for your help!

olivertam commented 6 years ago

Hi, To generate the TE GTF, it is easiest to make a tab-delimited file containing the following information: chromosome, start, stop, strand and TE name. If you have information regarding the TE family or class, that would be even better. You can then run the makeTEgtf perl script to generate the output From your GTF file, you could remove all the lines starting with #, and then cut columns 1,4,5,7. You can then cut out column 9 and clean it up so that just the TE name is there, and then paste it to your other columns. You could also remove simple repeats (e.g. (AT)n and G-rich) from the annotations. You can then run makeTEgtf.pl -c 1 -s 2 -e 3 -o 4 -n RepeatMasker -t 5 -1 on the file with the cut columns to generate the GTF file. I generated a GTF file from your GFF3 using the method described above, and it is available here. Please let me know if that works for you. Thanks.

oronoc1210 commented 6 years ago

Thanks so much for your help! It works perfectly for me!

Best, Conor

Ramkynanjing commented 2 years ago

Dear Olivertam, could you share the method generated by you, I could not find the URL,

I also want to GTF file from GFF file.

olivertam commented 2 years ago

Hi,

Thank you for your interest in the software. I have updated the link to the perl script, but you can also get it here. Please don't hesitate to contact me if you have any questions about the process.

Thanks.

Ramkynanjing commented 2 years ago

Dear Oliver Tam,

Greetings!

Thank you so for your reply. I generated TE.gtf, using the following script. perl makeTEgtf.pl -c 1 -s 2 -e 3 -o 4 -n RepeatMasker -t 5 -1 TEinput.txt>TE.gtf. It is perfect, thank you

While running the program, TEtranscripts show this error, Please see attached screen shot images.

I need count list for TE, but I got readcount only for Gene.

Please send me your valuble suggestion.

With regards

Ramakrishnan

------------------ Original ------------------ From: "Oliver @.>; Date: Thu, Dec 2, 2021 11:41 PM To: @.>; Cc: "Muthusamy @.>; @.>; Subject: Re: [mhammell-laboratory/TEtranscripts] TE gff3 to gtf? (#33)

Hi,

Thank you for your interest in the software. I have updated the link to the perl script, but you can also get it here. Please don't hesitate to contact me if you have any questions about the process.

Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

olivertam commented 2 years ago

Hi Ramakrishnan,

I cannot see the screenshot (you may have to attach it to your response on Github). Regarding only getting gene counts but no TE counts, please ensure that the chromosome names of your TE GTF and gene GTF match (no additional chr in the name, for example).

Thanks.

Ramkynanjing commented 2 years ago

Dear Professor Oliver Tam,

Thank you so for your quick reply. I am so sorry,

In this mail, I have attached a PPT file, in which I have pasted the screen shot images.

I also shown the screen shot images of TE GTF and gene GTF.

I think the chromosome name is correct.

Please send me your valuble suggestion.

With regards

Ramakrishnan

------------------ Original ------------------ From: "Oliver @.>; Date: Fri, Dec 3, 2021 11:06 AM To: @.>; Cc: "Muthusamy @.>; @.>; Subject: Re: [mhammell-laboratory/TEtranscripts] TE gff3 to gtf? (#33)

Hi,

I cannot see the screenshot (you may have to attach it to your response on Github). Regarding only getting gene counts but no TE counts, please ensure that the chromosome names of your TE GTF and gene GTF match (no additional chr in the name, for example).

Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

olivertam commented 2 years ago

HI Ramakrishnan,

I'm afraid that none of the attachments will work if you respond via email. You will need to respond on Github itself. Alternatively, you can send me an email at tam at cshl dot edu

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Thank you for your quick reply. While running the program, transcripts show some errors, I have pasted the screenshot images, I need a count list for TE, but I got a read count only for Gene; I also showed the screenshot images of TE GTF and gene GTF; Please send me your valuable suggestion. With regards Ramakrishnan

olivertam commented 2 years ago

Hi Ramakrishnan,

The "errors" are warning messages from the pysam module, and has no impact on TEtranscripts (see #82).

Looking at your TE GTF file, it appears the gene_id for all entries is RepeatMasker. They should instead correspond to the TE name. You might need to check the TEinput.txt file to ensure that the TE name is in the correct column.

It is also unusual that you had zero non-uniquely mapped reads in your BAM file. It is unclear what alignment parameters you are using, but given that many TE are repetitive elements in the genome, it is typically recommended to allow non-uniquely mapped reads when aligning (we typically allow up to 100 genomic alignments).

Hope this is helpful. Please feel free to let me know if you have further questions.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Many thanks for your response. The screenshot image shows the TE input file for generating the TE GTF file.

I used the following script

perl makeTEgtf.pl -c 1 -s 2 -e 3 -o 4 -n RepeatMasker -t 5 -1 TEinput.txt >TE.gtf

I am not sure whether I am correct.

with regards

Ramky

olivertam commented 2 years ago

Hi Ramky,

If you look at column 6 (attributes), you will find information such as repeat name, class etc... You will need to parse out the information to have those as separate columns, and then run the makeTEgtf.pl script with the correct columns for the various pieces of information.

Hope that is helpful. If you still have issues, you can provide the GFF3 file, and we can convert it to the TE GTF.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Many thanks for your kind response. Could I send you GFF3 personally,

with regards

olivertam commented 2 years ago

Hi,

Feel free to send the GFF3 to tam at cshl dot edu, or attach it on Github.

Thanks.

olivertam commented 2 years ago

Hi,

I have sent you the download links for the files, but I wanted to also note down the steps that were used to generate it. 1) Convert GFF into tab-separated file containing relevant info in various columns:

$ head -n 4 TE.gff
##gff-version 3
PH01000409      RepeatMasker    disperseRepeat  1       1130    9321    +       .       ID=repeat_TE0000001;Target=Repeat_8240 948 2073;Class=ClassII/TIR/CACTA;PercDiv=4.8;PercDel=0.3;PercIns=0.6;
PH01000409      RepeatMasker    disperseRepeat  913     1163    850     -       .       ID=repeat_TE0000002;Target=Repeat_14470 4640 4882;Class=ClassI/LINE/L1;PercDiv=15.2;PercDel=7.2;PercIns=10.7;
PH01000409      RepeatMasker    disperseRepeat  1130    1315    1235    +       .       ID=repeat_TE0000003;Target=Repeat_24486 6739 6927;Class=ClassI/LTR/Copia;PercDiv=13.4;PercDel=1.6;PercIns=0.0;

$ sed '/^#/d;s/;/  /g;s/[A-Za-z]*=//g;s/ / /g;s/\//:/;s/\//        /;s/?/Unknown/g' TE.gff | awk -F "      " -v OFS="      " 'NF==18;NF<18{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$13,$14,$15,$16,$17}' > TE_input.txt

Note: All gaps in the code above are tabs

$ head -n 3 TEinput.txt
PH01000409      RepeatMasker    disperseRepeat  1       1130    9321    +       .       repeat_TE0000001        Repeat_8240     948     2073    ClassII:TIR     CACTA   4.8     0.3     0.6     
PH01000409      RepeatMasker    disperseRepeat  913     1163    850     -       .       repeat_TE0000002        Repeat_14470    4640    4882    ClassI:LINE     L1      15.2    7.2     10.7    
PH01000409      RepeatMasker    disperseRepeat  1130    1315    1235    +       .       repeat_TE0000003        Repeat_24486    6739    6927    ClassI:LTR      Copia   13.4    1.6     0.0

2) Generate TE GTF

$ perl makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 10 -n RepeatMasker -f 14 -C 13 -S 6 -1 TE_input.txt > TE.gtf
$ head -n 3 TE.gtf
PH01000409      RepeatMasker    exon    1       1130    9321    +       .       gene_id "Repeat_8240"; transcript_id "Repeat_8240"; family_id "CACTA"; class_id "ClassII:TIR";
PH01000409      RepeatMasker    exon    913     1163    850     -       .       gene_id "Repeat_14470"; transcript_id "Repeat_14470"; family_id "L1"; class_id "ClassI:LINE";
PH01000409      RepeatMasker    exon    1130    1315    1235    +       .       gene_id "Repeat_24486"; transcript_id "Repeat_24486"; family_id "Copia"; class_id "ClassI:LTR";

Please let me know if there are any issues or further questions.

Thanks.

Ramkynanjing commented 2 years ago

Dear Oliver Tam,

Thank you so much for spending your valuble time in generating TE GTF file.

It is very clear and more useful to my research group.

with regards

Ramky

------------------ Original ------------------ From: "Oliver @.>; Date: Fri, Dec 3, 2021 08:39 PM To: @.>; Cc: "Muthusamy @.>; @.>; Subject: Re: [mhammell-laboratory/TEtranscripts] TE gff3 to gtf? (#33)

Hi,

I have sent you the download links for the files, but I wanted to also note down the steps that were used to generate it.

Convert GFF into tab-separated file containing relevant info in various columns: $ head -n 4 TE.gff ##gff-version 3 PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . ID=repeat_TE0000001;Target=Repeat_8240 948 2073;Class=ClassII/TIR/CACTA;PercDiv=4.8;PercDel=0.3;PercIns=0.6; PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . ID=repeat_TE0000002;Target=Repeat_14470 4640 4882;Class=ClassI/LINE/L1;PercDiv=15.2;PercDel=7.2;PercIns=10.7; PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . ID=repeat_TE0000003;Target=Repeat_24486 6739 6927;Class=ClassI/LTR/Copia;PercDiv=13.4;PercDel=1.6;PercIns=0.0; $ sed '/^#/d;s/;/ /g;s/[A-Za-z]*=//g;s/ / /g;s/\//:/;s/\// /;s/?/Unknown/g' TE.gff | awk -F " " -v OFS=" " 'NF==18;NF<18{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$13,$14,$15,$16,$17}' > TE_input.txt
Note: All gaps in the code above are tabs $ head -n 3 TEinput.txt PH01000409 RepeatMasker disperseRepeat 1 1130 9321 + . repeat_TE0000001 Repeat_8240 948 2073 ClassII:TIR CACTA 4.8 0.3 0.6 PH01000409 RepeatMasker disperseRepeat 913 1163 850 - . repeat_TE0000002 Repeat_14470 4640 4882 ClassI:LINE L1 15.2 7.2 10.7 PH01000409 RepeatMasker disperseRepeat 1130 1315 1235 + . repeat_TE0000003 Repeat_24486 6739 6927 ClassI:LTR Copia 13.4 1.6 0.0
Generate TE GTF $ perl makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 10 -n RepeatMasker -f 14 -C 13 -S 6 -1 TE_input.txt > TE.gtf $ head -n 3 TE.gtf PH01000409 RepeatMasker exon 1 1130 9321 + . gene_id "Repeat_8240"; transcript_id "Repeat_8240"; family_id "CACTA"; class_id "ClassII:TIR"; PH01000409 RepeatMasker exon 913 1163 850 - . gene_id "Repeat_14470"; transcript_id "Repeat_14470"; family_id "L1"; class_id "ClassI:LINE"; PH01000409 RepeatMasker exon 1130 1315 1235 + . gene_id "Repeat_24486"; transcript_id "Repeat_24486"; family_id "Copia"; class_id "ClassI:LTR";
Please let me know if there are any issues or further questions.

Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Thank you very much for your kind response and your clear explanation.

This is really very clear and useful to me.

with regards

Ramky

Ramkyeri commented 2 years ago

Dear Oliver Tam,

As per your suggestion "As part of the --sortByPos parameter, TEcount is resorting the file by name (using samtools). It then tries to load the file (which is a hidden file in the same folder) for processing. I'm assuming that you have sufficient space on your system, but that's worth checking. One option is to resort the BAM file by read name (samtools sort -n ...), and then run TEcount without the --sortByPos parameter"

I run the TEtranscripts, but I got warning message,

As you mentiond "The "errors" are warning messages from the pysam module, and has no impact on TEtranscripts (see #82)."

Shall I ignore this "warning messages"?

Thank you

with regards

Ramky

olivertam commented 2 years ago

Hi Ramky,

Yes, you can safely ignore those warning messages that you highlighted.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Thank you very much for your kind response.

with regards

Ramky

Ramkyeri commented 2 years ago

Dear Oliver Tam, Greetings!

I want to treat each copy of the TE as a distinct “gene”. Using R program, I merged ID with repeat name and generated TE.GTF. But output of the TE. GTF is not in the correct format.

I also learned that although the input file is in the text format, but the text file should be in the GFF3 file format; it should not in the random format.

In the R program, I subset only required information “Chrom column, start column, stop column, strand column, source, TE name, family column, class column”. After strand column, “dot column” this is also important.

After so many attempts, I generated GTF in the correct format, but I want to add underscore (_) between ID and repeat name.

I feel it will be better, if I use concatenate function in Linux. Could you show me one examples how to use concatenate function and how to add underscore (_) between ID and repeat name in Linux?

with regards

Ramky

Ramkyeri commented 2 years ago

Dear Oliver Tam, Greetings! What is the minimum system requirement to run TEtranscripts? I use WSL2 (windows subsystem for Linux), and my computer has 128 GB RAM. When I consider repeat name as gene, it took 3 hours to run 2 BAM files each with 10 GB size. However, while treating each copy of the TE as a distinct “gene”, TEtranscripts is taking long time, now 24 hours over, still it is running, I do not know how long it will take. Could you give me some suggestion?

With regards

Ramky

olivertam commented 2 years ago

Hi Ramky,

I have created a new thread (#105) for your question regarding system requirements of TEtranscripts. Please refer to that for my response.

Regarding making a locus-level TE GTF, You can do the following: 1) Take your TE_input.txt

$ head -n 3 TE_input.txt
PH01000409      RepeatMasker    disperseRepeat  1       1130    9321    +      .    repeat_TE0000001        Repeat_8240     948     2073    ClassII:TIR     CACTA  4.8      0.3     0.6
PH01000409      RepeatMasker    disperseRepeat  913     1163    850     -      .    repeat_TE0000002        Repeat_14470    4640    4882    ClassI:LINE     L1     15.2     7.2     10.7
PH01000409      RepeatMasker    disperseRepeat  1130    1315    1235    +      .    repeat_TE0000003        Repeat_24486    6739    6927    ClassI:LTR      Copia  13.4     1.6     0.0

2) Use awk to combine columns 9 and 10 with an underscore (_)

$ awk -F " " -v OFS="      " '{print $1,$2,$3,$4,$5,$6,$7,$8,$9 "_" $10,$11,$12,$13,$14,$15,$16,$17}' TE_input.txt > TE_input2.txt
$ head -n 3 TE_input2.txt
PH01000409      RepeatMasker    disperseRepeat  1       1130    9321    +      .    repeat_TE0000001_Repeat_8240    948     2073    ClassII:TIR     CACTA   4.8    0.3      0.6
PH01000409      RepeatMasker    disperseRepeat  913     1163    850     -      .    repeat_TE0000002_Repeat_14470   4640    4882    ClassI:LINE     L1      15.2   7.2      10.7
PH01000409      RepeatMasker    disperseRepeat  1130    1315    1235    +      .    repeat_TE0000003_Repeat_24486   6739    6927    ClassI:LTR      Copia   13.4   1.6      0.0

Note that the gaps after -F and -v OFS are tabs

3) Run makeTEgtf.pl

$ makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 9 -n RepeatMasker -f 13 -C 12 -S 6 -1 TE_input2.txt > TE2.gtf
$ head -n 3 TE2.gtf
PH01000409      RepeatMasker    exon    1       1130    9321    +       .      gene_id "repeat_TE0000001_Repeat_8240"; transcript_id "repeat_TE0000001_Repeat_8240"; family_id "CACTA"; class_id "ClassII:TIR";
PH01000409      RepeatMasker    exon    913     1163    850     -       .      gene_id "repeat_TE0000002_Repeat_14470"; transcript_id "repeat_TE0000002_Repeat_14470"; family_id "L1"; class_id "ClassI:LINE";
PH01000409      RepeatMasker    exon    1130    1315    1235    +       .      gene_id "repeat_TE0000003_Repeat_24486"; transcript_id "repeat_TE0000003_Repeat_24486"; family_id "Copia"; class_id "ClassI:LTR";

Let me know if you encounter any more issues.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Millions of thanks to you for your kind response and sending me the example.

The program is also completed, It took more than 36 hours.

with regards

Ramky

olivertam commented 2 years ago

Thank you for the update. As you can see from the logs, the creation of the TE index took the longest amount of time. This is part of the rationale of using pre-built indices for TElocal

All the best.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Many thanks for the information. I will try TElocal.

with regards

Ramky

Ramkyeri commented 2 years ago

Dear Oliver Tam,

How many read count file will be generated by TEtranscripts. In my analysis, I got only one read count file.

In the manual, it shows two output files for gene and TE.

with regards

Ramky

olivertam commented 2 years ago

Hi,

There is one output file, the two tables are combined together.

Thanks.

Ramkyeri commented 2 years ago

Thank you so much for your kind reply.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Thank you for providing very useful software. TEtranscripts and TElocal are the same, but, in order to run TElocal, we have to provide, indexed GTF files.

Could you show me some example how to generate this file, I do not have any ideas about this.

I can also use TEcount instead of TEtranscripts, and generate only count table alone that can be used for differential analysis.

the only difference is that TEtrnscripts will peroform differentiall analsis,

TElocal use indexed GTF file.

with regards

Ramky

olivertam commented 2 years ago

Hi,

I transferred your question to our TElocal repository. You can find it here.

Thanks

Ramkyeri commented 2 years ago

Dear Oliver Tam,

As per your suggestion, I run the following script,

perl makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -t 9 -n RepeatMasker -f 13 -C 12 -S 6 -1 TE_input2.txt > TEunigueID.gtf

but the program is running like this,

did you modify the Perl script, you asked me to run makeTEgtf_v2.pl, I do not have this,

please send me your suggestion.

with regards

Ramky

olivertam commented 2 years ago

Hi Ramky,

If you are running TElocal, you do not need to make a unique ID GTF file. Please check that your input file is tab-delimited/separated, and not space separated.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Many thanks, yes my files are not tab-delimited,

to run, TElocal, can I use the same GTF that you showed me. I also got the errors for the output file.

with regards

Ramky

olivertam commented 2 years ago

Hi Ramky,

Please use the original GTF (TE.gtf) for building the TElocal index. Regarding your error, please ensure that you are not copying and pasting the command line, but attempt to type it out. There could be formatting errors that are introduced. If that does not resolve the issue, we can generate the file for you.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Many thanks for your support.

I do not know why Tab key function is not working in my WSL2. So, I used the following script,

awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9 "_" $10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"}' TE_input.txt > TE_input2.txt

and run makeTEgtf.pl (Is it correct?)

According to my understanding, two kinds of analysis:

We can treat each copy as a distinct gene, or 2. we can treat all the same copies as a single gene,

For example, in TE.gtf, Repeat_8240 is present 361 times, (total copy number). we can treat all copies as a single gene or we can treat each copy as a distinct gene

could you explain to me, what is locus-specific, TElocal is locus-specific analysis. I am sorry, I could not understand this,

With regards

Ramky

olivertam commented 2 years ago

Hi Ramky,

If you are trying to insert a tab on the Linux command line, you will need to do the following combination: ctrl-V, then tab. That should hopefully work. To answer your question: the way that RepeatMasker works is that it takes a set of TE consensus sequences, and then identify genomic regions that matches the consensus. When we treat "all copies of the same TE as a single gene", it means that we aggregate counts for all identified copies of the TE consensus in the genome as belonging to that TE consensus (which we sometimes designated as 'sub-famiily'). For a locus-specific analysis, each copy will be given its own count (rather than aggregated to the identified consensus), and so you will get values for each TE copy (which we sometimes designate as 'instance'). So in your example, each copy (361) of the Repeat_8240 will have a count (rather than an aggregated count). Just a reminder, TElocal uses the original GTF (your TE.gtf) and not the modified one (TE2.gtf) to build the index.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Thank you for your kind explanation and guidance.

with regards

Ramky

Ramkyeri commented 2 years ago

Dear Oliver Tam,

I think TEtranscripts also provides Locus-level TE RNA quantification and can be used any species.

am I correct? I think locus-level is each of copy TE

I do not know why they compared like this.

with regards

olivertam commented 2 years ago

Hi,

TEtranscripts was not designed for locus/copy level quantification, and thus not typically utilized this way. TElocal is our locus-level implementation.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

thank you, again I am confused with my GTF file. Now, I have two kinds of GTFs file.

locus/copy level quantification is confusing me.

with regards

Ramky

olivertam commented 2 years ago

One of your GTF file (TE.gtf) uses the repeat type/subfamily as the gene_id, while the other (TE2.gtf) used the TE copy name as the gene_id. The former is what TEtranscripts typically expects, whereas the other one was a "hack" to make TEtranscripts do locus-level quantification. The "hack" has now been replaced by TElocal

What TEtranscripts does is to quantify (and aggregates) all reads from a particular TE (e.g. Repeat_14470, which is a type of L1) throughout the genome. TElocal (locus level) quantifies that particular copy of Repeat_14470 at PH01000409 at position 913-1163.

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam,

Thank you so much. This is results of TE2.gtf.

The repeat_8240 is present more than 300 times. if I use TE.gtf, I will get only count value for repeat_8240 by combining all 300 counts into single count because it uses the repeat type/subfamily as the gene_id.

Whereas If I use TE2.gtf, I will get read count for each epeat_8240, because in TE2.gtf, each epeat_8240 is treated as gene ID because of this "repeat_TE0004914".

Is it correct?

with regards

Ramky

Ramkyeri commented 2 years ago

Dear Oliver Tam,

If I use TElocal, I have to use TE.gtf, not TE2.gtf.

is it correct?

with regards

Ramky

olivertam commented 2 years ago

Hi,

Yes, you are correct regarding the difference between TE.gtf and TE2.gtf. TElocal is designed to work with TE.gtf to mimic TE2.gtf (which is a "hack" forTEtranscripts).

Thanks.

Ramkyeri commented 2 years ago

Dear Oliver Tam, Thank you. Ramky

666lixiaona commented 9 months ago

who can provide BAM profile and GTF profile for gene and TE?thank you very much

olivertam commented 9 months ago

Hi,

Are you asking for where you can get the input files? You generate the BAM from aligning your reads against the genome build of choice. We typically use STAR, and allow up to 100 mismatches (though that's based on mammalian genomes). For gene GTF, you can get them from UCSC, Ensembl or GENCODE, though that could possibly change the genome build that you're using (i.e. you want to use the FASTA with the same chromosome nomenclature as your GTF source). For TE GTF, you can get them here, and for certain species, you would need to get the corresponding GTF based on the source of your FASTA and GTF (e.g. hg38 for UCSC, GRCh38_Ensembl for Ensembl, GRCh38_GENCODE for GENCODE).

Thanks.

666lixiaona commented 9 months ago

Thank you very much. Can you provide a BAM profiles for me?I just want to know how the software work now.Later maybe I will use it to do research. Also can you introduce how the transcript work?Because the software don't provide an example to show how it works.I have difficulty to understand it.

olivertam commented 9 months ago

Hi,

If you want some test data, they are available here. For usage information, feel free to read our README. For the description of our algorithm, you can read the corresponding publication. For a proposed workflow, you can read this.

Please let me know if you have other questions.

Thanks.

mhammell-laboratory / TEtranscripts

TE gff3 to gtf? #33