Closed SolKatzman closed 4 years ago
Hi Sol,
Thanks for your observation. I can confirm that this is introduced from the UCSC RepeatMasker track for hg38. What is really interesting is that this issue does not exist for hg19, and in fact, the "minority" annotations are the ones used in hg19. Furthermore, checking the latest Repbase that I have access to (r24.03), the "majority" annotations for hg38 matches that, and not the hg19 annotations. My (purely speculative) suspicion is that they ran RepeatMasker with newer RepBase on the hg38 genome, but also did a liftover from hg19 to hg38 and probably crosschecking that entries from hg19 were also present in hg38. Thus, my theory is that the "minority" annotations were entries that, after liftover, were not present in the new hg38 RepeatMasker run. Since they probably didn't expect the Repbase TE family annotation to change between releases, they might not have confirmed that the family ID stayed the same before combining the two files. I can't say that this is also the case for the non-human primates, as I haven't looked into them as closely. I'll probably have to parse through each GTF to see how widespread this is, and could potentially change them all to reflect the "majority" annotation (or maybe the RepBase annotation).
Thanks again for letting me know.
Dear Oliver,
Sounds like a reasonable hypothesis. Attached are the stats that I extracted for the panTro6 and ponAbe3 TE_rmsk files. They are quite similar to each other, have more such elements than hg38, but several of those have only a tiny handful of minority family IDs.
Sol.
Hi Sol,
Thanks again for your help. This matches with what I found as well. This might be a recurring issue, as I found similar occurrences in older builds of other genomes (e.g. mm9 and panTro4). Interestingly, these are different to the ones found in the later builds (e.g. mm10 and panTro6), which suggests that the discrepant TE family annotations are not simply perpetuate to newer genome build. I wonder if these are genomic regions that were difficult to liftOver. I noticed at least 5 of minority MER121 in the hg38 genome to be potentially originating from a single MER121 in hg19 that was split in two entries after the liftOver (with a 100 bp gap in between). Anyway, I'm currently in the process of going through the affected GTF and renaming the family ID to the "majority" one. I will let you know once they are done. Thanks.
Hi So,
The renaming is done, and the GTF (and indices) should no longer have the "minority" TE family annotations. Please let me know if you have any issues with the new files.
Thanks.
Dear Oliver,
Thanks for the quick response.
I have contacted some relevant UCSC folks (you are cc'd), pointing them to this github issue.
So it may get corrected eventually at the browser database(s).
/Sol.
Dear Oliver,
The hg38 TE_rmsk.gtf files now look fine.
Unfortunately, in the panTro6 and ponAbe3 files, there are 5 cases each where there are 2 class_id values for the same gene_id ("DNA" and "DNA?"). Sorry to bug you again. I modified the title of this issue to include class_id.
Attached are the 2 files with these cases.
Hi Sol,
Thanks for finding that. This is the first time I've seen them use the "?" notation in the class ID (they use it frequently in the family ID, and the script that generates the TE GTF file removes them from the family ID as part of the process). I'll remake those GTF files now.
Thanks.
Hi Sol,
The two GTF (and their indices) should now be fixed.
Thanks.
Dear Oliver,
Looks good. Thanks for all the effort.
/Sol.
This is a nitpick.
The TE_rmsk.gtf files for the few assemblies that I have looked at (hg38, panTro6, ponAbe3) have cases where a single gene_id has more than one family_id. I assume that this is not expected (or desirable).
I don't know how these GTFs are built or where the problem originates -- quite possibly in the UCSC browser tracks, in which case we should complain to them. The net result is that TEcount has multiple lines in the cntTable output file for the same item. These could be post-processed to collapse them, but it is probably better to handle this as far upstream as possible.
Here are the 6 cases that I found in hg38_TE_rmsk, which have between about 1% and 10% of the "minority" family:
11 MER105 DNA DNA
715 MER105 hAT-Charlie DNA
10 MER121 TcMar DNA
910 MER121 hAT DNA
16 MER96 hAT DNA
1272 MER96 hAT-Tip100 DNA
121 MER96B hAT DNA
2112 MER96B hAT-Tip100 DNA
206 MamRep137 TcMar DNA
2154 MamRep137 TcMar-Tigger DNA
144 MamRep4096 hAT DNA
1088 MamRep4096 hAT-Tip100 DNA
/Sol.