mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

S. cerevisiae TE annotation file #27

Closed lfra closed 6 years ago

lfra commented 6 years ago

Hello,

I would like to perform a TEtoolkit analysis on RNA-seq data from S. cerevisiae, but could not find any TE annotation file for this organism among those that you provided. Do you have any suggestions on how I can build one myself and based on which database? As far as I am aware of, there is no such neat repeat-masker resource as for the human or mouse genome.

The only "comprehensive" resource that I found is the following: https://www.yeastgenome.org/reference/S000072465 but I am unsure how to extract this into an annotation file.

I would really appreciate your help. Thank you in advance

olivertam commented 6 years ago

Hi,

I took a quick look at the website, and found the other features folder. If you download the other_features_genomic.fasta.gz file, it has the genomic location of the repetitive elements in the sequence name (based on the R64-2-1 release). They appear to match the naming used in the publication that you provided.

What you can do is to extract the genomic location (e.g. I:707-776), the TE name (ARS102), and determine what family or class (e.g. ARS) of transposable element it belongs to (you can use the TE name), and generate a GTF file with this information. The only thing that I cannot easily determine is the genomic strand (I'm assuming that they are showing the "+" strand in the file, but that doesn't mean that the TE is on that strand). I would recommend finding the sequence of one of the TE, and comparing it with the sequence in the FASTA file to see if it's the same or reverse complement.

To generate the GTF file, follow the guidelines [here] (http://genome.ucsc.edu/FAQ/FAQformat.html#format4). The features (column 3) should be "exon". The essential attributes in the info field (column 9) are "gene_id", "transcript_id", "family_id", "class_id". Ensure the "transcript_id" is unique if possible. Also make sure the chromosome name matches the reference sequences that you are aligning to. Here is an example line: I SGB exon 707 776 . + . gene_id "ARS102"; transcript_id "ARS102"; family_id "ARS"; class_id "ARS"

Please let me know if there are any questions. Thanks.

Cheers, Oliver

lfra commented 6 years ago

Thank you very much! I will give it a try.