mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

prepare the GTF file for wheat(IWGSC RefSeq v1.1) #182

Closed wj0922 closed 2 months ago

wj0922 commented 3 months ago

Hi team, I'm currently analyzing TE expression in wheat using TEtranscripts, but the wheat TE GTF file is not available in your web storage. The wheat version is IWGSC RefSeq v1.1. Could you please guide me on how to prepare the GTF file for wheat? Thank you very much.

olivertam commented 3 months ago

Hi,

Thank you for your interest in the software.

You would need to download the transposable element annotations for your genome build. I did not find one for RefSeq v.1.1, but I see one for RefSeq v2.1. I'm not sure how easy it is to convert between wheat genome builds.

Once you have the TE annotations, you will need to generate a GTF file (see here for basic format), with column 9 containing the following fields: gene_id, transcript_id, family_id and class_id. If your TE annotation is a tabular file with columns of data, you could try using our perl script to generate the TE GTF.

Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] 
                     -o [strand column] -n [source] -t [TE name column] 
                     (-f [TE family column] -C [TE class column] -1)
                     [INFILE]
 Required parameters:
  -c [chrom column]     -    Column containing chromosome name
  -s [start column]     -    Column containing feature start position
  -e [stop/end column]  -    Column containing feature stop/end position
  -o [strand column]    -    Column containing strand information (+ or -)
  -t [TE name column]   -    Column containing TE name
  [INFILE]              -    File name to be processed into GTF

 Optional parameters:
  -n [source]           -    Source of the TE information 
                             (e.g. mm9_rmsk for RepeatMasker track from
                              mm9 mouse genome)
                             Defaults to "user-provided" if not specified
  -f [TE family column] -    Column containing TE family name. 
                             Defaults to TE name if not specified
  -C [TE class column]  -    Column containing TE class name. 
                             Defaults to TE family name if not specified
  -S [score column]     -    Column containing the score of the TE prediction
                             (e.g. score from RepeatMasker)
  -1                    -    Input coordinates uses 1-based indexing
                             This should be used if the input file uses
                             1-based coordinates. This should be invoked
                             if the genomic coordinates are obtained from
                             a GFF3, GTF, SAM or VCF file
                             Default: off if using BED, BAM or UCSC rmsk
                                      input files

If you encounter any issues, we'll be happy to try and troubleshoot.

Thanks.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days