Using a GFF3 for alignment

mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.

http://hammelllab.labsites.cshl.edu/software/#TEtranscripts

GNU General Public License v3.0

206 stars 29 forks source link

Using a GFF3 for alignment #68

Closed r-mashoodh closed 4 years ago

r-mashoodh commented 4 years ago

I had 2 questions about TEtranscripts ...

Is passing a genome GTF during alignment necessary? I have made my reference using a GFF3 and passed the GFF3 (with the the --sjdbGTFtagExonParentTranscript Parent flag) during alignment fails.
For the TE GTF is it sufficient to have a single field in the last column: eg. transcript_id "someUniqueID" or does it need class/family etc?
For --sjdbOverhang 100 is this determined by read length?

Thank you!

olivertam commented 4 years ago

Hi,

TEtranscripts expects the gene and TE annotations are in the GTF format. However, it is unclear from your question whether you are using the GFF3 for STAR alignment (which should not be required if you had built your genome with it), or using it for TEtranscripts.
The TE GTF expects the following fields: gene_id (which is what is used for quantification), transcript_id (which just needs to be a unique ID), class_id and family_id. For the latter two, you can use the gene_id value if you don't have that additional information. Their values are appended to the annotation in the output (e.g. SVA_A:SVA:Retroposons)
I would refer you to this to answer your --sjdbOverhang question.

Thanks.

r-mashoodh commented 4 years ago

Thanks so much for your response. I have another question about the GTF -- how does it cope with having potentially overlapping coordinates? The reason I ask is that I'm trying to merge RepeatModeler and RepeatMasker outputs (for a non-model organism).

What would be your advice for curating a final GTF?

Thank you again.

olivertam commented 4 years ago

Hi,

TEtranscripts can handle overlapping coordinates in the GTF file, as this is common in gene annotation files (due to the numerous alternate transcripts). However, this will have an impact on your quantification. If you have two independent annotations (e.g. one from RepeatModeler and one from RepeatMasker) that are pointing to the same genomic location, but have different identifier (gene_id), then any reads that are aligned to that region will have its counts distributed to the two annotations. This means that you might be splitting your counts to two different entries/lines in your output, when they should be combined. My suggestion for the merge would be to eliminate entries that exactly overlap each other, and are derived from the same or similar consensus sequence used for modelling. You can certainly "relax" the overlap to 95% or 90% (this is completely up to you), with the goal of minimizing unnecessary overlap of annotations (although the definition of "unnecessary" would be dependent on the researcher).

Please let me know if you have other questions. Thanks.

r-mashoodh commented 4 years ago

Sorry for all the questions ... but I'm wondering what prompts this error:

TE inconsistency! 26253..26521#MITE/DTA:NW_017095469.1:26253..26521#MITE/DTA:NW_017095469.1:26253..26521#MITE/DTA:MITE_DTA
Error: 1
[Exception type: SystemExit, raised in TEindex.py:384]

I have a rather strange looking GTF based on merging from various LTR finders (file here).

olivertam commented 4 years ago

Hi, Thanks for uploading a portion of the file. I'll take a look at it.

olivertam commented 4 years ago

Hi,

I took a look at the file, and found the following line (line 61) that probably led to your issue:

NW_017095469.1  nves_edta       exon    26253   26521   .       -       .       
gene_id "NW_017095469.1:26253..26521#MITE/DTA"; transcript_id "NW_017095469.1:26
253..26521#MITE/DTA"; family_id "NW_017095469.1:26253..26521#MITE/DTA"; class_id
 "MITE_DTA";

The gene_id is very long, and has a # in it. I feel like this would cause many issues down the line with the annotation. I would almost suggest trimming the name in some way, since the current id has the chromosome name, start and stop, which is probably duplicating the information on the rest of the line. One suggestion is to have the gene_id as MITE_DTA, transcript_id as MITE_DTA_dup1, the family_id as MITE_DTA and the class_id as DNA_transposon or ClassII_transposon. There are also other lines that have this nomenclature style, and thus might also contribute to the error. Please let me know if you have other questions. Thanks.

r-mashoodh commented 4 years ago

Great. Thanks. I got it working!