Closed r-mashoodh closed 4 years ago
Hi,
TEtranscripts
expects the gene and TE annotations are in the GTF format. However, it is unclear from your question whether you are using the GFF3 for STAR
alignment (which should not be required if you had built your genome with it), or using it for TEtranscripts
.gene_id
(which is what is used for quantification), transcript_id
(which just needs to be a unique ID), class_id
and family_id
. For the latter two, you can use the gene_id
value if you don't have that additional information. Their values are appended to the annotation in the output (e.g. SVA_A:SVA:Retroposons
)--sjdbOverhang
question.Thanks.
Thanks so much for your response. I have another question about the GTF -- how does it cope with having potentially overlapping coordinates? The reason I ask is that I'm trying to merge RepeatModeler and RepeatMasker outputs (for a non-model organism).
What would be your advice for curating a final GTF?
Thank you again.
Hi,
TEtranscripts can handle overlapping coordinates in the GTF file, as this is common in gene annotation files (due to the numerous alternate transcripts).
However, this will have an impact on your quantification. If you have two independent annotations (e.g. one from RepeatModeler and one from RepeatMasker) that are pointing to the same genomic location, but have different identifier (gene_id
), then any reads that are aligned to that region will have its counts distributed to the two annotations. This means that you might be splitting your counts to two different entries/lines in your output, when they should be combined.
My suggestion for the merge would be to eliminate entries that exactly overlap each other, and are derived from the same or similar consensus sequence used for modelling. You can certainly "relax" the overlap to 95% or 90% (this is completely up to you), with the goal of minimizing unnecessary overlap of annotations (although the definition of "unnecessary" would be dependent on the researcher).
Please let me know if you have other questions. Thanks.
Sorry for all the questions ... but I'm wondering what prompts this error:
TE inconsistency! 26253..26521#MITE/DTA:NW_017095469.1:26253..26521#MITE/DTA:NW_017095469.1:26253..26521#MITE/DTA:MITE_DTA
Error: 1
[Exception type: SystemExit, raised in TEindex.py:384]
I have a rather strange looking GTF based on merging from various LTR finders (file here).
Hi, Thanks for uploading a portion of the file. I'll take a look at it.
Hi,
I took a look at the file, and found the following line (line 61) that probably led to your issue:
NW_017095469.1 nves_edta exon 26253 26521 . - .
gene_id "NW_017095469.1:26253..26521#MITE/DTA"; transcript_id "NW_017095469.1:26
253..26521#MITE/DTA"; family_id "NW_017095469.1:26253..26521#MITE/DTA"; class_id
"MITE_DTA";
The gene_id is very long, and has a #
in it. I feel like this would cause many issues down the line with the annotation. I would almost suggest trimming the name in some way, since the current id has the chromosome name, start and stop, which is probably duplicating the information on the rest of the line.
One suggestion is to have the gene_id as MITE_DTA
, transcript_id as MITE_DTA_dup1
, the family_id as MITE_DTA
and the class_id as DNA_transposon
or ClassII_transposon
.
There are also other lines that have this nomenclature style, and thus might also contribute to the error.
Please let me know if you have other questions.
Thanks.
Great. Thanks. I got it working!
I had 2 questions about TEtranscripts ...
Is passing a genome GTF during alignment necessary? I have made my reference using a GFF3 and passed the GFF3 (with the the
--sjdbGTFtagExonParentTranscript Parent
flag) during alignment fails.For the TE GTF is it sufficient to have a single field in the last column: eg. transcript_id "someUniqueID" or does it need class/family etc?
For
--sjdbOverhang 100
is this determined by read length?Thank you!