Open ifokkema opened 4 years ago
Are these the variants that don't give GRCh38?
NC_000001.10:g.144811842C>G NC_000001.10:g.148004568C>A NC_000001.10:g.148756540C>G
Not necessarily a "failing" in UTA, it is more likely that certain transcripts have been annotated against GRCh37 but not GRCh38 or even vice-versa. If these are the variants, I can look at trying to perform a genome_to_genome direct lift????
Are these the variants that don't give GRCh38?
These are some, yes. All four that I mentioned don't have an hg38 liftover. There's a whole range in chr1 (and most likely, other chromosomes, too) that often don't provide hg38 values. I had to skip a few million bp if I remember correctly, to get my validation script to continue.
Not necessarily a "failing" in UTA, it is more likely that certain transcripts have been annotated against GRCh37 but not GRCh38 or even vice-versa.
Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common. I thought earlier you mentioned VV would fall back to a genome-to-genome liftover? Could that be done automatically?
If these are the variants, I can look at trying to perform a genome_to_genome direct lift????
If it's manual work, better not; these are just random examples. If you want a full list, we can wait until my script finishes; it's going by about 1.5 chromosomes per week :stuck_out_tongue_closed_eyes:
Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common. I thought earlier you mentioned VV would fall back to a genome-to-genome liftover? Could that be done automatically?
Currently only for intergenic but the code exists. I'll try have a go now because I'm bored :)
Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common.
People still use legacy transcripts so I see a fair amount of both. Never really looked at the numbers but you could well be right
If it's manual work, better not; these are just random examples. If you want a full list, we can wait until my script finishes; it's going by about 1.5 chromosomes per week
It shouldn't be much work. Will be slower though. I'll start anyway. Really am bored! :)
Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common. I thought earlier you mentioned VV would fall back to a genome-to-genome liftover? Could that be done automatically?
Currently only for intergenic but the code exists. I'll try have a go now because I'm bored :)
Cool :smile:
Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common.
People still use legacy transcripts so I see a fair amount of both. Never really looked at the numbers but you could well be right
Hmm... I decided to check using your gene2transcripts service;
Looks like I'm going to need to spend time updating transcripts all over LOVD.
If it's manual work, better not; these are just random examples. If you want a full list, we can wait until my script finishes; it's going by about 1.5 chromosomes per week
It shouldn't be much work. Will be slower though. I'll start anyway. Really am bored! :)
That makes one of us! :laughing:
Hey, give me a break man. I've been on zoom meetings all day!!! :)
So, mostly coded it up. The liftover from GRCh37 to 38 for NC_000001.10:g.144811842C>G works, and it lifts back using PyLiftover. Problem is, the lifted back coordinate is out. Might be because the variant is in a gene that is inverted in GRCh38 compared to 37. I'll look tomorrow.
NM_001037675.2 indeed has no hg38 mapping, NM_001037675.3 does.
Not uncommon
NM_015383.1 indeed has no hg38 mapping, NM_015383.2 does.
Not uncommon
NM_001102663.1 is even more interesting; gene2transcripts knows it's a NBPF15 transcript, uses it in the LOVD endpoint, but gene2transcripts doesn't actually list it at all!
Bit weird. I'll take a look
I suspect the gene symbol in UTA is wrong. Will look. We are gonna start using HGNC number mapping instead in the next next build. Make this type of issue go for good!
Hey, give me a break man. I've been on zoom meetings all day!!! :)
Haha, that must be terrible :stuck_out_tongue_closed_eyes: I'm lucky to not have many meetings, so I can mostly focus on work that matters.
So, mostly coded it up. The liftover from GRCh37 to 38 for NC_000001.10:g.144811842C>G works, and it lifts back using PyLiftover.
Cool!!
Problem is, the lifted back coordinate is out. Might be because the variant is in a gene that is inverted in GRCh38 compared to 37. I'll look tomorrow.
You mean, a sense gene became antisense? I really wonder about the quality of the genome at that location, if they have missed something like that.
NM_001102663.1 is even more interesting; gene2transcripts knows it's a NBPF15 transcript, uses it in the LOVD endpoint, but gene2transcripts doesn't actually list it at all!
I suspect the gene symbol in UTA is wrong. Will look. We are gonna start using HGNC number mapping instead in the next build. Make this type of issue go for good!
That would still be awkward because I queried for the transcript itself, so it was mapped to a gene, and then not reverse-mapped anymore. Either way, it shows I also still have a lot of work to do on my end, getting our transcripts updated! I can use HGNC ID as a query in gene2transcripts if needed, so that should be OK.
You mean, a sense gene became antisense? I really wonder about the quality of the genome at that location, if they have missed something like that.
Yep, here are the PyLiftover outputs. 37 > 38 [('chr1', 149075768, '-', 51897113)] 38 > 37 [('chr1', 148340351, '+', 20849626768)] Input is NC_000001.10:g.144811842C>G
The + or - is the strand I believe.
I suspect that in this region, the difference between 37 and 38 is significant. Annoying the tool doesn't lift back to the same position though. I will contact the authors to see if they can shed light There is a nice blog here http://fouryears.eu/2013/02/25/the-curse-of-genomic-coordinates/
That would still be awkward because I queried for the transcript itself, so it was mapped to a gene, and then not reverse-mapped anymore. Either way, it shows I also still have a lot of work to do on my end, getting our transcripts updated! I can use HGNC ID as a query in gene2transcripts if needed, so that should be OK.
No need, we will use the HGNC ID internally because most users wont use it, but it will be an option. We will also make the new version of UTA available to you. I think you will hit the wall along the way because I know there are genes which do not have a common transcript (at the version level at least) mapped to both GRCh37 and GRCh38. Although, we are trying to put as much legacy data and up-to-date data into out new UTA, so hopefully we can close this gap a bit
Ok, now testing NC_000001.10:g.148004568C>A
Again it misses and again is in a + / - region [('chr1', 146066338, '+', 90706)] liftback [('chr1', 145368662, '-', 39323927)] NC_000001.10:g.148004568C>A
Finally NC_000001.10:g.148756540C>G Same Method=PyLiftover [('chr1', 146066460, '-', 400720)] liftback [('chr1', 146215217, '+', 20849626768)] NC_000001.10:g.148756540C>G
Yep, here are the PyLiftover outputs. 37 > 38 [('chr1', 149075768, '-', 51897113)] 38 > 37 [('chr1', 148340351, '+', 20849626768)] Input is NC_000001.10:g.144811842C>G
The + or - is the strand I believe.
Yeah, so lifting over 37 > 38 > 37 ends up in a different place, which is not good. I could contact the UCSC guys; I was in touch with somebody who was building new chain files (but got sidetracked with COVID-19). Did you compensate for PyLiftOver using 0-based coordinates? Might not matter much here, but is still important if we don't want to be off by one all the time.
We will also make the new version of UTA available to you. I think you will hit the wall along the way because I know there are genes which do not have a common transcript (at the version level at least) mapped to both GRCh37 and GRCh38. Although, we are trying to put as much legacy data and up-to-date data into out new UTA, so hopefully we can close this gap a bit
I'm a bit worried about that; keeping the legacy data in is my best chance for a smooth transition to newer versions of currently used transcripts, or mapping to different transcripts entirely.
Ok, now testing NC_000001.10:g.148004568C>A
Again it misses and again is in a + / - region [('chr1', 146066338, '+', 90706)] liftback [('chr1', 145368662, '-', 39323927)] NC_000001.10:g.148004568C>A
I'm not sure how this score is calculated (last value in the tuple), but 90706 seems rather low compared to other values, suggesting lifting over hg19/GRCh37 to hg38/GRCh38 doesn't work well there.
Finally NC_000001.10:g.148756540C>G Same Method=PyLiftover [('chr1', 146066460, '-', 400720)] liftback [('chr1', 146215217, '+', 20849626768)] NC_000001.10:g.148756540C>G
I guess this makes sense; all my examples were from the same area on chr1. Looks like this whole area is flipped from hg19/GRCh37 to hg38/GRCh38.
Either way, the last two variants are mapped pretty closeby each other on 38, which is strange. They're 750K bases apart on hg19/GRCh37.
Yeah, so lifting over 37 > 38 > 37 ends up in a different place, which is not good. I could contact the UCSC guys; I was in touch with somebody who was building new chain files (but got sidetracked with COVID-19). Did you compensate for PyLiftOver using 0-based coordinates? Might not matter much here, but is still important if we don't want to be off by one all the time.
Definitely, please do contact them. Yes, all coordinates compensated for. The first version of the tool checked both the reference sequence and the position there and back so was very accurate. I dropped the reference sequence check for one of the other issues we fixed, but don't really want to drop the coordinate check as well.
I wonder if they can explain the tuple better too. The docs on PyLiftover are sparse!
I guess this makes sense; all my examples were from the same area on chr1. Looks like this whole area is flipped from hg19/GRCh37 to hg38/GRCh38.Either way, the last two variants are mapped pretty closeby each other on 38, which is strange. They're 750K bases apart on hg19/GRCh37.
Yep, agreed. Something is screwing over the lift. Again, the tool used is UCSC so hopefully they can shed light. Can you report back please mate?
p.s. I bet you are loving this as much as I am.
I wonder if they can explain the tuple better too. The docs on PyLiftover are sparse!
Yeah, I noticed that, too. I had to go into the source code just to find out what the last value (the liftover score) was. Now I just know it's a score and they sort on it, but I don't know any reference values.
Yep, agreed. Something is screwing over the lift. Again, the tool used is UCSC so hopefully they can shed light. Can you report back please mate?
Sure, I will!
Just to verify, I used the online LiftOver to match your PyLiftOver results, and they are nearly the same.
hg19 input:
chr1:144811841-144811842
chr1:148004567-148004568
chr1:148756539-148756540
hg19 -> hg38
chr1:149075770-149075771
chr1:146066337-146066338
chr1:146066462-146066463
hg38 -> hg19
chr1:148340353-148340354
chr1:145368664-145368665
chr1:146215219-146215220
Minor differences may be due to updated chain files? Anyway, I'll check with my contacts!
p.s. I bet you are loving this as much as I am.
Haha, about half as much, I'm afraid! Even if we find a way to fix this, I foresee an immense amount of work that I'll still have to do after all of this. I am excited to wrap this all up in some writing though :wink:
Stay in the present mate and celebrate small victories!
Could well be up-to-date files. Might need to see what PyLiftover drags in
Running my tests. It looks like that for less difficult genes my updated code may have fixed this issue. I'll take a look anf feed back some results
Yep, it looks like the new method has produced what I consider to be accurate enough mappings to GRCh38 in tests where they were not available before. 78 in total. I have work to do updating the tests AAAAAAAARGH. Ivo, I'm slowly falling out with you HAHAHAHAHAH! :)
Cool! I'll try next week...!
AAAAAAAARGH. Ivo, I'm slowly falling out with you HAHAHAHAHAH! :)
Hahahaha! It's because it's Friday! I've been rebuilding tests for weeks now... my head hurts. I'm done with this week, and this week is done with me... Time for me to sign off... have a great weekend! :partying_face:
Hey Pete, just checking; all issues remain open so far;
NM_001102663
is recognized by gene2transcripts but then not mentioned in the output.It's not a problem whatsoever (likely we're waiting for Angie/PyLiftOver to respond), just in case I'm missing something since you seemed to mention you fixed something? :thinking: I don't mind at all waiting for Angie/PyLiftOver!
Hi Ivo, Yes, all still open. I have built the liftover code in, but now optimising it because it has made testing slow then fix the tests. Bit slow going because of teaching and figuring our how best to optimise
So we have effectively closed for GRCh38 mapping. Gonna look at the orientation issue after fixing https://github.com/openvar/variantValidator/issues/203
Ok, Referring to
Remap and new liftOver chains to hg19 and hg38. See https://genome-test.gi.ucsc.edu. The track is called "hg38 mapping" or "hg19 mapping" (but I'm considering renaming it to "liftOver alignments", wouldn't that be better?)
Will leave this until discussion with @ifokkema
Do these new files generate any different results?
Hi @ifokkema I have been slammed with grants etc. Are the new files available? If so where can I get them. I can test locally.
Thanks
p.s. @ifokkema I updated the dev API this morning with the latest VariantValidator version
Hi @ifokkema I have been slammed with grants etc. Are the new files available? If so where can I get them. I can test locally.
Getting them out of the table browser isn't as easy as I thought, so I have asked for downloads :)
p.s. @ifokkema I updated the dev API this morning with the latest VariantValidator version
Thanks!
@ifokkema .Wonder if these were ever d=deployed???
Seems like the mappings are still missing. I again updated the links.
Hi Pete!
Today I'll be going through a few variants that my validation script ran into, so you might get a few issues reported by me today.
I'm not sure if these are useful, but I have a few examples of variants that provide no hg38 mapping.
Sometimes, with a processing error, for instance:
NC_000001.10:g.38077350_38077420del
Note: it seems the processing errors are related to #173, but even for the transcript it does support, there is no hg38 prediction.
There are also many variants that have no errors whatsoever, but just don't provide a hg38 mapping. If this is a limitation of UTA and it can't be fixed there, how can we still get hg38 predictions? Can we somehow activate a genome-to-genome liftover?
NC_000001.10:g.144811842C>G
NC_000001.10:g.148004568C>A
NC_000001.10:g.148756540C>G
Thank you, as always!