mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

GTF files for rice, also external index #96

Closed zhangaicen closed 2 years ago

zhangaicen commented 2 years ago

Hello Oliver, I'd like to use TEtranscripts to analyze TE in rice (Oryza sativa japonica), however I don't know how to get the TE GTF and index files in the given website, while most of the files in the web is about animals. My reference genome is Nipponbare reference genome (MSU v7.0, [http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/]).

So could you tell me how can I get the TE GTF file?

Thanks very much for your time.

olivertam commented 2 years ago

Hi,

Thank you for your interest in the software. If there is a file within that contains the location and identity of transposable elements (or repeats in general) in the MSU v7.0 database, you can either use a perl script (available here to convert the output into a GTF that is compatible to TEtranscripts, or we can work with you to generate the GTF. Unfortunately, at the time of this reply, I am unable to access the URL that you provided.

Thanks.

This is the usage information for the perl script:

 Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] 
                     -o [strand column] -n [source] -t [TE name column] 
                     (-f [TE family column] -C [TE class column] -1)
                     [INFILE]

 Output is printed to STDOUT

 Required parameters:
  -c [chrom column]     -    Column containing chromosome name
  -s [start column]     -    Column containing feature start position
  -e [stop/end column]  -    Column containing feature stop/end position
  -o [strand column]    -    Column containing strand information (+ or -/C)
  -t [TE name column]   -    Column containing TE name
  [INFILE]              -    File name to be processed into GTF

 Optional parameters:
  -n [source]           -    Source of the TE information 
                             (e.g. mm9_rmsk for RepeatMasker track from
                              mm9 mouse genome)
                             Defaults to "user-provided" if not specified
  -f [TE family column] -    Column containing TE family name. 
                             Defaults to TE name if not specified
  -C [TE class column]  -    Column containing TE class name. 
                             Defaults to TE family name if not specified
  -S [score column]     -    Column containing the score of the TE prediction
                             (e.g. score from RepeatMasker)
  -1                    -    Input coordinates uses 1-based indexing
                             This should be used if the input file uses
                             1-based coordinates. This should be invoked
                             if the genomic coordinates are obtained from
                             a GFF3, GTF, SAM or VCF file
                             Default: off if using BED, BAM or UCSC rmsk
                                      input files
zhangaicen commented 2 years ago

Hi Oliver, thanks for your quick response, I had the TE.bed file as below: image and use the makeTEgtf.pl I've converted it into a GTF file: image So it is the suitable GTF file for further analysis?

best, Thanks.

olivertam commented 2 years ago

Hi,

It looks quite reasonable to me. The only thing that might be annoying is that your annotations will read as follows: DNA/Helitron:DNA/Helitron:DNA/Helitron This is because the program concatenates the gene, family and class ID together. This will not break the program, but the names are just long. Otherwise, I think it should be good to use. Let me know if you encounter any more issues.

Thanks.

zhangaicen commented 2 years ago

OK, thanks Oliver, may disturb you if any other problems in the future.

Thanks and best wishes.

zhangaicen commented 2 years ago

Hi Oliver, I’m trying to quantify the rice TE expression using the bam file from Hisat2 with --no-mixed, as a newer of TE analysis, there are still some problems make me confused:

  1. To better understand this software, I tried TEtranscript and TElocal, the software can help to build the index, but TElocal report an error ”TE annotation file needs to be a TElocal index, which will end in .locInd“, and to save time for every running, is there a method to pre-prepare .ind and .loclnd index files?

  2. In the reference: DNA hypomethylation in tetraploid rice potentiates stress-responsive gene expression for salt tolerance , I see the author display the TE expression between two samples processed by TEtranscript using boxplot, 图片1 so which result coming from TEtranscript can represent TE expression like gene expression(using FPKM,CPM...)

3.In my log of transcripts, I saw very few reads were annotated as TE reads, is it reasonable or something wrong with my workflow? image

Thanks for your help.

olivertam commented 2 years ago

Hi,

  1. TElocal utilizes a different index form (.locInd) than TEtranscripts (.ind). We can try to build the indices for you, but will need the TE GTF that you generated.
  2. I am not sure which paper you are referring to, but the raw counts from TEtranscripts (.cntTable) can be put through differential analysis pipelines (e.g. DESeq2, which is called by TEtranscripts). You can then generate normalized counts and/or variance stabilized counts, one of which I'm assuming were used for the plot. However, I would refer to their methods for the precise details.
  3. In our analyses, we see 15-25% of our counts attributable to TE (for human). I'm not sure what the expected proportion is in rice, but it would be heavily dependent on the quality of the TE annotation.

Thanks.

zhangaicen commented 2 years ago

Hi,

  1. Attached is my TE GTF, please help to get the index file.
  2. In their legend,they mentioned the Y-axis indicated the values of standard transcript abundance normalization by DESeq (default methods in TEtranscripts), so what does this refer to?
  3. It has 55801 genes in rice gtf and 364081 TEs in TE gtf , but in the TEtranscripts (.cntTable) I found 55801 genes are all in the file, but there were only 160 TEs, and many with 0 counts, I think this is abnormal, but can not find the reason. image MSU.TE.withoutcentro_satelite.gtf.gz
olivertam commented 2 years ago

Hi,

  1. I have started making the index files, and will let you know once they are ready.
  2. The approach that DESeq2 uses for normalization is known as "median of ratios", which you can read more here. We are completely dependent on DESeq2 for normalization and differential analysis, and thus would refer you to their documentation if you want more details.
  3. Although you have 364081 lines in your TE GTF, they correspond to the individual copies of the TE. In TEtranscripts (unlike TElocal), the TE counts are aggregated into subfamily (in essence, the gene_id), and so you end up with 159 distinct TE "subfamiliies". It is not unusual to have zero counts for some subfamilies if they are not active, but this is also dependent on the completeness of the TE annotation, and whether you allowed for sufficient multimappers during alignments (we have typically found up to 100 multimappers to be useful for Drosophila and mammals).

Please let me know if you have other questions. Thanks.

olivertam commented 2 years ago

Hi,

The pre-built index for TEtranscripts is here, while the pre-built index for TElocal is here. They will need to be decompressed before use.

Thanks.