mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

details about making GTF file for TEtranscripts #63

Closed AIBio closed 4 years ago

AIBio commented 4 years ago

Hi! I have read content in the issue. Three questions puzzled me about using the perl script "makeTEgtf.pl".

1) [INFILE] which format of te annotation file is needed? gtf or saf ? The gtf file downloaded from UCSC rmsk track seemly don't include the TE name/family/class. Here is the first row of my gtf file:

chr1 hg38_rmsk exon 67108754 67109046 1892.000000 + . gene_id "L1P5"; transcript_id "L1P5";

But the saf file includes these information

bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id

0 1892 83 59 14 chr1 67108753 67109046 -181847376 + L1P5 LINE L1 5301 5607 -544 1

2) should I input the column name(genoName, genoStart, genoEnd...) or column index (1,2,3...)?

3) Is the swScore in saf file the score you mentioned in perl scripts?

Just right row, I have tried to choose saf file as input file and use numeric number to input the columns. The result is shown below, could you help me to check whether it is right? the first two rows:

chr1 hg38_rmsk exon 67108754 67109046 . + . gene_id "L1P5"; transcript_id "L1P5"; family_id "L1"; class_id "LINE"; chr1 hg38_rmsk exon 8388316 8388618 . - . gene_id "AluY"; transcript_id "AluY"; family_id "Alu"; class_id "SINE";

the AluY with multiple locations (the transcript_id of the first location of AluY is named as AluY without "_dup", is that right ?)

chr1 hg38_rmsk exon 8388316 8388618 . - . gene_id "AluY"; transcript_id "AluY"; family_id "Alu"; class_id "SINE"; chr1 hg38_rmsk exon 41942895 41943205 . - . gene_id "AluY"; transcript_id "AluY_dup1"; family_id "Alu"; class_id "SINE"; chr1 hg38_rmsk exon 218103554 218103843 . + . gene_id "AluY"; transcript_id "AluY_dup2"; family_id "Alu"; class_id "SINE"; chr1 hg38_rmsk exon 30408564 30408860 . - . gene_id "AluY";

Best wishes! Hanwen Yu 30th March, 2020

olivertam commented 4 years ago

Hi,

  1. The file that I've been using is the sql table output from UCSC, but the SAF file would work if it contains the right columns. I have to confess that it doesn't really support GTF as an input (since it doesn't put the TE name in a separate column), but other column-based files (e.g. BED or tab-delimited annotaions) should work.
  2. Please use the column index (1, 2, 3 etc). Sorry if this wasn't clear.
  3. I use the swScore column, but that's optional (it is not utilized by TEtranscripts itself)
  4. Your output file looks reasonable.
  5. Your assumptions about the "_dup" is correct. The first instance would not have the "_dup" in the transcript_id
  6. The best way to test if the file is correct would be to try it with TEtranscripts. If it's parsed correctly, then you should be able to run it without too many issues. We are thinking about a "GTF checker", and might integrate it with another tool that we're developing.

Thanks.

AIBio commented 4 years ago

Great! Your reply is really helpful. I will close this issue.