mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Making TE.gtf using Genome.fa.out file made by RepeatMasker? #83

Closed wangshun1121 closed 1 year ago

wangshun1121 commented 3 years ago

I called TE elements using RepeatModeler and RepeatMasker from genome of Phaeodactylum tricornutum and here is the out file by RepeatMasker:

Phaeodactylum_tricornutum.ASM15095v2.dna_sm.toplevel.fa.out.gz

Can we make TE.gtf file using such out files? Are there any demo?

olivertam commented 3 years ago

Hi,

Thanks for your interest in the software. Your file is actually quite amenable to conversion into a TE GTF. We use this script to do the conversion:

 Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] 
                     -o [strand column] -n [source] -t [TE name column] 
                     (-f [TE family column] -C [TE class column] -1)
                     [INFILE]

 Output is printed to STDOUT

 Required parameters:
  -c [chrom column]     -    Column containing chromosome name
  -s [start column]     -    Column containing feature start position
  -e [stop/end column]  -    Column containing feature stop/end position
  -o [strand column]    -    Column containing strand information (+ or -)
  -t [TE name column]   -    Column containing TE name
  [INFILE]              -    File name to be processed into GTF

 Optional parameters:
  -n [source]           -    Source of the TE information 
                             (e.g. mm9_rmsk for RepeatMasker track from
                              mm9 mouse genome)
                             Defaults to "user-provided" if not specified
  -f [TE family column] -    Column containing TE family name. 
                             Defaults to TE name if not specified
  -C [TE class column]  -    Column containing TE class name. 
                             Defaults to TE family name if not specified
  -S [score column]     -    Column containing the score of the TE prediction
                             (e.g. score from RepeatMasker)
  -1                    -    Input coordinates uses 1-based indexing
                             This should be used if the input file uses
                             1-based coordinates. This should be invoked
                             if the genomic coordinates are obtained from
                             a GFF3, GTF, SAM or VCF file
                             Default: off if using BED, BAM or UCSC rmsk
                                      input files

There are a couple of things you might need to consider/change in your file before running the script:

  1. Convert all the C in the orientation column to -, since that information is utilized.
  2. Make the columns tab-delimted.
  3. Add a # on the header lines, or remove them for the script.
  4. In some of your hits (e.g. element id 2 in your file), the program is predicting that your element was genomically split. How do you want to deal with those? You could treat them as independent "exons", but it's unclear if they are actually being spliced (and thus you would not assign reads originating from the region in between them). Alternatively, you would take the entire region as one "insertion", though you would need to then merge the two entries into one line (and correspondingly change the start and stop of that prediction).
  5. Currently you have the class and family information in the same column. That would work fine, but you could also split them up (and in cases where there isn't a corresponding family ID, just use the class ID.

If you make changes 1 to 3 as listed above, then I might try running the script with the following parameters:

perl makeTEGTF.pl -c 5 -s 6 -e 7 -o 9 -t 10 -f 11 -S 1 [your altered output] > [TE GTF]

I've attached a version of your file that should work with the above command line.

Note that the script will remove most simple repetitive sequences and short non-coding RNA (e.g. tRNA) Please let us know if you encounter any issues.

Thanks.

wangshun1121 commented 3 years ago

Well, I modified your makeTEGTF.pl slightly, and mine is here: makeTEGTF.pl.gz .

I correct your codes, firstly using multi-space rather than tab as splitter, than if($tmp[$strand] eq 'C'){$tmp[$strand]='-';}. Using same command line as your instruction and ignore warnings, and gtf file will be exported from stdout:

perl makeTEGTF.pl -c 5 -s 6 -e 7 -o 9 -t 10 -f 11 -S 1 Genome.RepeatMasker.out 1>RM.gtf 2>stderr
olivertam commented 3 years ago

Sounds good.

Let us know if you encounter any issues.

Thanks.

annsophiegironne commented 1 year ago

Hi @olivertam !

I'm trying to run the makeTEgtf.pl script but I keep getting this error message on every line: "Strand information on line 11606 "+" is not recognized. Only "+", "-" or "C" are accepted". However, this is what my fa.out file looks like:

score   chr     begin   end     strand  repeat  class   family
35      GL000008.2      107     157     +       AluYk3  SINE    Alu
24      GL000008.2      444     507     +       AluY    SINE    Alu
47      GL000008.2      597     663     +       Alu     SINE    Alu
25      GL000008.2      857     914     +       Alu     SINE    Alu
44      GL000008.2      1039    1102    +       AluJb   SINE    Alu
22      GL000008.2      1347    1406    +       AluYk2  SINE    Alu
47      GL000008.2      1807    1874    +       Alu     SINE    Alu
32      GL000008.2      2070    2122    +       AluSg   SINE    Alu
24      GL000008.2      2490    2552    +       AluYh9  SINE    Alu

Any idea why it doesn't work? Thanks a lot!

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software. What is the command line that you're using with the makeTEgtf.pl script? Also, could you share your fa.out file with me, or show me what line 11606 looks like? You can use the following command:

sed -n '11605,11607p' fa.out

Thanks.

annsophiegironne commented 1 year ago

Hi,

I used this command line: perl gff_gtf/makeTEgtf.pl -c 2 -s 3 -e 4 -o 5 -n RepeatMasker -t 6 -f 8 -c 7 -S 1 -1 gff_gtf/RepeatMasker.clean.fa.out > gff_gtf/TE.gtf

I used the command provided and the output looks like this:

35      KI270745.1      39672   39745   +       MLT2B5  LTR     ERVL
33      KI270745.1      39765   39845   +       MLT2B5  LTR     ERVL
29      KI270745.1      39858   39946   +       MLT2B5  LTR     ERVL

\\ I was going to send the reply but then I tried re-running it again just in case and the error changed to 'Strand information on line 13187 () is not recognized. Only "+", "-" or "C" are accepted. Line 13187 is skipped' and when I look up this line, I get this: * NA I tried deleting all lines containing 'NA' and the script worked perfectly. So I guess the problem solved itself! Thanks for the help and for the script!

olivertam commented 1 year ago

Glad to hear that the error has been resolved.

All the best. Thanks.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days