marbl / CHM13

The complete sequence of a human genome
Other
882 stars 96 forks source link

99 new genes? #62

Closed renjieshang closed 1 year ago

renjieshang commented 2 years ago

Dear T2T team,

I need help to find the 99 "novel" protein-coding genes as reported in the Science paper (Vol 376, Issue 6588 pp. 44-53), quote:

"1956 of the genes exclusive to CHM13 (99 protein coding) are in regions with no primary alignment to GRCh38 (table S11)."

However, in table S11, it appears that those 99 genes (biotype: protein coding; novel region: 1) already existed in GRCh38 and were annotated. For instance, the NPIPB15 annotated in CHM13 (Gene ID: LOFF_G0001213) did exist in GRCh38 and was under the same gene name.

Can you please help clarify if my search or understanding of the paper is right? And most importantly, can you provide the exact coding sequences for those 99 novel genes? Thank you. My email is renjieshang@uga.edu

skoren commented 2 years ago

The novel protein-coding genes are paralogs of those on GRCh38, it's not the genes that have no alignment but the regions where they are placed on CHM13 that have no alignments. You can see the identity to the nearest gencode gene in table S13. There are some genes annotated in CHM13 from IsoSeq but those don't have a biotype so we don't know if they're protein coding, they may be but wouldn't be in the list of 99.

The protein coding transcripts are available in https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/genes/catLiftOffGenesV1.protein.fa.gz and in AA space (note this file is by transcript so you have to search using the IDs in Table S12). There is also the GFF3 file which can be used to get the exact transcript coordinates as well as translating gene IDs to transcripts. For example, your gene ID (LOFF_G0001213) translate to LOFF_T0001548 which is in catLiftOffGenesV1.protein.fa.gz file I mentioned above.