mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

generate custom GTF for new species #113

Closed MontseTor closed 2 years ago

MontseTor commented 2 years ago

Hi!! I would like to run TEtranscripts but my species of interest doesn't have a UCSC model yet. I have a low quality genome assembly and a very high quality transcriptome assembly for this species. Could you maybe indicate what would be your recommended protocol to generate the required gene GTF and TE GTF files? Thank you very much in advance!

olivertam commented 2 years ago

Hi,

We typically recommend gene GTF from curated sources (e.g. UCSC, Ensembl, GENCODE). However, if you have a good gene transcriptome assembly that can be mapped onto your genome assembly, you should be able to generate the relevant GTF file (see here for GTF specification).

For TE, you will need to have the copies of the transposable elements annotated on your genome build. This could be done with a variety of software (e.g. RepeatMasker), which will give you genomic coordinates of predicted TE copies. Once you have that, you can the use the following script to convert that into a TE GTF that would work with TEtranscripts.

Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] 
                     -o [strand column] -n [source] -t [TE name column] 
                     (-f [TE family column] -C [TE class column] -1)
                     [INFILE]

Output is printed to STDOUT

 Required parameters:
  -c [chrom column]     -    Column containing chromosome name
  -s [start column]     -    Column containing feature start position
  -e [stop/end column]  -    Column containing feature stop/end position
  -o [strand column]    -    Column containing strand information (+ or -)
  -t [TE name column]   -    Column containing TE name
  [INFILE]              -    File name to be processed into GTF

 Optional parameters:
  -n [source]           -    Source of the TE information 
                             (e.g. mm9_rmsk for RepeatMasker track from
                              mm9 mouse genome)
                             Defaults to "user-provided" if not specified
  -f [TE family column] -    Column containing TE family name. 
                             Defaults to TE name if not specified
  -C [TE class column]  -    Column containing TE class name. 
                             Defaults to TE family name if not specified
  -S [score column]     -    Column containing the score of the TE prediction
                             (e.g. score from RepeatMasker)
  -1                    -    Input coordinates uses 1-based indexing
                             This should be used if the input file uses
                             1-based coordinates. This should be invoked
                             if the genomic coordinates are obtained from
                             a GFF3, GTF, SAM or VCF file
                             Default: off if using BED, BAM or UCSC rmsk
                                      input files

Thanks.

MontseTor commented 2 years ago

Thank you so much for your fast response. This is also what I had in mind, just wanted to double check there wasn't something more specific. Great!

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days