mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Difference between provided TE GTF and TE GTF made from UCSC tracks by makeTEgtf.pl #75

Closed XinyuXiang closed 3 years ago

XinyuXiang commented 3 years ago

Hi, I used the makeTEgtf.pl you provided here to make hg38 TE GTF file from UCSC RepeatMasker track as you suggested here. But the numbers of TE between GTF file you provided (hg38_rmsk_TE.gtf, downloaded from here) and GTF file made from UCSC hg38 RepeatMasker tracks by makeTEgtf.pl are different, which are 4777125 and 4693511 separately. Could you please let me know where's the problem?

I downloaded the UCSC RepeatMasker track from here (genome: human, assembly: human, track: RepeatMasker, table: rmsk, output format: all fields from selected table).

This is part of the UCSC RepeatMasker track downloaded (hg38.ucsc.tracks): #bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id 0 1892 83 59 14 chr1 67108753 67109046 -181847376 + L1P5 LINE L1 5301 5607 -544 1 1 2582 27 0 23 chr1 8388315 8388618 -240567804 - AluY SINE Alu -15 296 1 1 1 4085 171 77 36 chr1 25165803 25166380 -223790042 + L1MB5 LINE L1 5567 6174 0 4

This is my code used to make hg38 TE GTF: perl makeTEgtf.pl -c 6 -s 7 -e 8 -o 10 -n rmsk -t 11 -f 12 -C 13 hg38.ucsc.tracks > hg38.ucsc.tracks.gtf

This is part of the TE GTF file I made (hg38.ucsc.tracks.gtf): chr1 rmsk exon 67108754 67109046 . + . gene_id "L1P5"; transcript_id "L1P5"; family_id "LINE"; class_id "L1"; chr1 rmsk exon 8388316 8388618 . - . gene_id "AluY"; transcript_id "AluY"; family_id "SINE"; class_id "Alu"; chr1 rmsk exon 25165804 25166380 . + . gene_id "L1MB5"; transcript_id "L1MB5"; family_id "LINE"; class_id "L1";

Looking forward to your reply! Thank you!

Xinyu

olivertam commented 3 years ago

Hi Xinyu,

Thank you for your interest in our software. The reason why there are more lines in the GTF generated from the current RMSK output is because UCSC have added new chromosome patches (containing _fix in their name). If you are using these chromosomes for alignment and annotation purposes, it might have some impact. There are no clear consensus on how best to incorporate patches into most analyses. For variant calling, GATK recommends using the calls from patches over the canonical chromosomes, which makes sense for that application. However, for RNA-seq and ChIP-seq, where misaligments due to single (or small number) nucleotide changes are better tolerated, it is less clear if the patches add to the analysis. We feel the same way about alternative haplotype chromosomes (those with _alt in their name), and thus, we haven't been using either in our analyses. Our advice is that chromosome patches are not critical for TEtranscript, but should not cause dramatic problems even if present.

Just a quick note on your command line: the repFamily and repClass columns are 13 and 12 respectively, and so you might want to use -f 13 -C 12 when running the perl code. Thanks.

XinyuXiang commented 3 years ago

Hi Oliver:

Thank you so much for quick and detailed reply! I've learned a lot from your clear explanation. It is so nice of you to point out my mistake on the code. :)

Best, Xinyu

emattei commented 1 year ago

Hi I would like to reopen this issue because the explanation above does not explain the discrepancy. I am looking at the mouse file. UCSC rmsk: 5333739 UCSC rmsk (no alt, no fix, no random, no chrUn): 5138231 mm10_rmsk_TE.gtf.gz: 3725827

Is it because the provided GTF is very old? Thank you

olivertam commented 1 year ago

Hi,

Thank you for your question. In the RMSK output, there are also entries for Low complexity repeats (Low_complexity), small non-coding RNA (rRNA, scRNA, snRNA, srpRNA, tRNA) and simple repetitive sequences (Simple_repeat). These are not included in the TE GTF, as we don't think that they should be considered for the purposes of quantifying transposable elements.

If you eliminate those, you should get close to the number that you see in the TE GTF file.

We do keep satellite sequences, and it could be argued that they should not be considered TE. However, we haven't determined whether their regulation is linked to mechanisms that govern TE expression, and thus have kept them for now.

Thanks.