higlass / higlass-transcripts

Gene transcripts track for HiGlass
MIT License
6 stars 2 forks source link

Codon blocks are not always three bases long #5

Closed alexpreynolds closed 3 years ago

alexpreynolds commented 3 years ago

For this issue, I am using hg38 sequence data, working off of the higlass-transcripts example viewconf object (https://aveit.s3.amazonaws.com/higlass/data/sequence/hg38.fa).

Around the region chr3:71015547-71015554, the protein sequence appears not to fall on a three-base increment as would be expected for a codon:

screencapture-epilogos-altius-org-3001-1602004032838

As seen in the snapshot above, the Lysine residue appears to span two bases, instead of three.

At the other end of this exon block, around chr3:71015649-71015655, the first codon (Serine) is one base long:

screencapture-epilogos-altius-org-3001-1602004843978

It seems as if the start and end coordinates of the codon blocks might not be calculated correctly, or might not be derived correctly from the underlying sequence.

This also appears to affect forward-stranded exons (e.g., around chr3:85961639-85961649). I dug into a samtools faidx query of the sequence data at the position of the single-residue codon block:

$ samtools faidx hg38.fa chr3:85961647-85961647
>chr3:85961647-85961647
G

It's not clear how this is being mapped to aspartic acid (D/Asp).

alexander-veit commented 3 years ago

Exons don't always start and end with a full codon. They can be broken up in between. Let's consider CADM2-203 in your last example. We have chr3:85961647-85961647 -> G. The next exon has chr3:85979164-85979165 -> AT. Together this is GAT, which is aspartic acid. Does that make sense?

alexpreynolds commented 3 years ago

Yes, sorry — of course!