TE_rmsk gtf file has multiple family_id or class_id for gene_id items

mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.

http://hammelllab.labsites.cshl.edu/software/#TEtranscripts

GNU General Public License v3.0

217 stars 29 forks source link

TE_rmsk gtf file has multiple family_id or class_id for gene_id items #59

Closed SolKatzman closed 4 years ago

SolKatzman commented 4 years ago

This is a nitpick.

The TE_rmsk.gtf files for the few assemblies that I have looked at (hg38, panTro6, ponAbe3) have cases where a single gene_id has more than one family_id. I assume that this is not expected (or desirable).

I don't know how these GTFs are built or where the problem originates -- quite possibly in the UCSC browser tracks, in which case we should complain to them. The net result is that TEcount has multiple lines in the cntTable output file for the same item. These could be post-processed to collapse them, but it is probably better to handle this as far upstream as possible.

Here are the 6 cases that I found in hg38_TE_rmsk, which have between about 1% and 10% of the "minority" family:

11 MER105 DNA DNA 715 MER105 hAT-Charlie DNA

10 MER121 TcMar DNA 910 MER121 hAT DNA

16 MER96 hAT DNA 1272 MER96 hAT-Tip100 DNA

121 MER96B hAT DNA 2112 MER96B hAT-Tip100 DNA

206 MamRep137 TcMar DNA 2154 MamRep137 TcMar-Tigger DNA

144 MamRep4096 hAT DNA 1088 MamRep4096 hAT-Tip100 DNA

/Sol.

olivertam commented 4 years ago

Hi Sol,

Thanks for your observation. I can confirm that this is introduced from the UCSC RepeatMasker track for hg38. What is really interesting is that this issue does not exist for hg19, and in fact, the "minority" annotations are the ones used in hg19. Furthermore, checking the latest Repbase that I have access to (r24.03), the "majority" annotations for hg38 matches that, and not the hg19 annotations. My (purely speculative) suspicion is that they ran RepeatMasker with newer RepBase on the hg38 genome, but also did a liftover from hg19 to hg38 and probably crosschecking that entries from hg19 were also present in hg38. Thus, my theory is that the "minority" annotations were entries that, after liftover, were not present in the new hg38 RepeatMasker run. Since they probably didn't expect the Repbase TE family annotation to change between releases, they might not have confirmed that the family ID stayed the same before combining the two files. I can't say that this is also the case for the non-human primates, as I haven't looked into them as closely. I'll probably have to parse through each GTF to see how widespread this is, and could potentially change them all to reflect the "majority" annotation (or maybe the RepBase annotation).

Thanks again for letting me know.

SolKatzman commented 4 years ago

Dear Oliver,

Sounds like a reasonable hypothesis. Attached are the stats that I extracted for the panTro6 and ponAbe3 TE_rmsk files. They are quite similar to each other, have more such elements than hg38, but several of those have only a tiny handful of minority family IDs.

Sol.

panTro6_rmsk.gtf.gene_id.multi.family.class.stats.txt

ponAbe3_rmsk.gtf.gene_id.multi.family.class.stats.txt

olivertam commented 4 years ago

Hi Sol,

Thanks again for your help. This matches with what I found as well. This might be a recurring issue, as I found similar occurrences in older builds of other genomes (e.g. mm9 and panTro4). Interestingly, these are different to the ones found in the later builds (e.g. mm10 and panTro6), which suggests that the discrepant TE family annotations are not simply perpetuate to newer genome build. I wonder if these are genomic regions that were difficult to liftOver. I noticed at least 5 of minority MER121 in the hg38 genome to be potentially originating from a single MER121 in hg19 that was split in two entries after the liftOver (with a 100 bp gap in between). Anyway, I'm currently in the process of going through the affected GTF and renaming the family ID to the "majority" one. I will let you know once they are done. Thanks.

olivertam commented 4 years ago

Hi So,

The renaming is done, and the GTF (and indices) should no longer have the "minority" TE family annotations. Please let me know if you have any issues with the new files.

Thanks.

SolKatzman commented 4 years ago

Dear Oliver,

Thanks for the quick response.

I have contacted some relevant UCSC folks (you are cc'd), pointing them to this github issue.

So it may get corrected eventually at the browser database(s).

/Sol.

SolKatzman commented 4 years ago

Dear Oliver,

The hg38 TE_rmsk.gtf files now look fine.

Unfortunately, in the panTro6 and ponAbe3 files, there are 5 cases each where there are 2 class_id values for the same gene_id ("DNA" and "DNA?"). Sorry to bug you again. I modified the title of this issue to include class_id.

Attached are the 2 files with these cases.

panTro6_rmsk.gtf.gene_id.multi.family.class.stats.txt

ponAbe3_rmsk.gtf.gene_id.multi.family.class.stats.txt

olivertam commented 4 years ago

Hi Sol,

Thanks for finding that. This is the first time I've seen them use the "?" notation in the class ID (they use it frequently in the family ID, and the script that generates the TE GTF file removes them from the family ID as part of the process). I'll remake those GTF files now.

Thanks.

olivertam commented 4 years ago

Hi Sol,

The two GTF (and their indices) should now be fixed.

Thanks.

SolKatzman commented 4 years ago

Dear Oliver,

Looks good. Thanks for all the effort.

/Sol.