openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
70 stars 21 forks source link

Variants that don't return a hg38 mapping. #201

Open ifokkema opened 4 years ago

ifokkema commented 4 years ago

Hi Pete!

Today I'll be going through a few variants that my validation script ran into, so you might get a few issues reported by me today.

I'm not sure if these are useful, but I have a few examples of variants that provide no hg38 mapping.

Sometimes, with a processing error, for instance: NC_000001.10:g.38077350_38077420del

Note: it seems the processing errors are related to #173, but even for the transcript it does support, there is no hg38 prediction.

There are also many variants that have no errors whatsoever, but just don't provide a hg38 mapping. If this is a limitation of UTA and it can't be fixed there, how can we still get hg38 predictions? Can we somehow activate a genome-to-genome liftover?

NC_000001.10:g.144811842C>G NC_000001.10:g.148004568C>A NC_000001.10:g.148756540C>G

Thank you, as always!

Peter-J-Freeman commented 4 years ago

Are these the variants that don't give GRCh38?

NC_000001.10:g.144811842C>G NC_000001.10:g.148004568C>A NC_000001.10:g.148756540C>G

Not necessarily a "failing" in UTA, it is more likely that certain transcripts have been annotated against GRCh37 but not GRCh38 or even vice-versa. If these are the variants, I can look at trying to perform a genome_to_genome direct lift????

ifokkema commented 4 years ago

Are these the variants that don't give GRCh38?

These are some, yes. All four that I mentioned don't have an hg38 liftover. There's a whole range in chr1 (and most likely, other chromosomes, too) that often don't provide hg38 values. I had to skip a few million bp if I remember correctly, to get my validation script to continue.

Not necessarily a "failing" in UTA, it is more likely that certain transcripts have been annotated against GRCh37 but not GRCh38 or even vice-versa.

Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common. I thought earlier you mentioned VV would fall back to a genome-to-genome liftover? Could that be done automatically?

If these are the variants, I can look at trying to perform a genome_to_genome direct lift????

If it's manual work, better not; these are just random examples. If you want a full list, we can wait until my script finishes; it's going by about 1.5 chromosomes per week :stuck_out_tongue_closed_eyes:

Peter-J-Freeman commented 4 years ago

Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common. I thought earlier you mentioned VV would fall back to a genome-to-genome liftover? Could that be done automatically?

Currently only for intergenic but the code exists. I'll try have a go now because I'm bored :)

Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common.

People still use legacy transcripts so I see a fair amount of both. Never really looked at the numbers but you could well be right

If it's manual work, better not; these are just random examples. If you want a full list, we can wait until my script finishes; it's going by about 1.5 chromosomes per week

It shouldn't be much work. Will be slower though. I'll start anyway. Really am bored! :)

ifokkema commented 4 years ago

Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common. I thought earlier you mentioned VV would fall back to a genome-to-genome liftover? Could that be done automatically?

Currently only for intergenic but the code exists. I'll try have a go now because I'm bored :)

Cool :smile:

Very likely, indeed. Although I thought the reverse (no mapping on hg19) is much more common.

People still use legacy transcripts so I see a fair amount of both. Never really looked at the numbers but you could well be right

Hmm... I decided to check using your gene2transcripts service;

Looks like I'm going to need to spend time updating transcripts all over LOVD.

If it's manual work, better not; these are just random examples. If you want a full list, we can wait until my script finishes; it's going by about 1.5 chromosomes per week

It shouldn't be much work. Will be slower though. I'll start anyway. Really am bored! :)

That makes one of us! :laughing:

Peter-J-Freeman commented 4 years ago

Hey, give me a break man. I've been on zoom meetings all day!!! :)

So, mostly coded it up. The liftover from GRCh37 to 38 for NC_000001.10:g.144811842C>G works, and it lifts back using PyLiftover. Problem is, the lifted back coordinate is out. Might be because the variant is in a gene that is inverted in GRCh38 compared to 37. I'll look tomorrow.

Peter-J-Freeman commented 4 years ago

NM_001037675.2 indeed has no hg38 mapping, NM_001037675.3 does.

Not uncommon

NM_015383.1 indeed has no hg38 mapping, NM_015383.2 does.

Not uncommon

NM_001102663.1 is even more interesting; gene2transcripts knows it's a NBPF15 transcript, uses it in the LOVD endpoint, but gene2transcripts doesn't actually list it at all!

Bit weird. I'll take a look

Peter-J-Freeman commented 4 years ago

I suspect the gene symbol in UTA is wrong. Will look. We are gonna start using HGNC number mapping instead in the next next build. Make this type of issue go for good!

ifokkema commented 4 years ago

Hey, give me a break man. I've been on zoom meetings all day!!! :)

Haha, that must be terrible :stuck_out_tongue_closed_eyes: I'm lucky to not have many meetings, so I can mostly focus on work that matters.

So, mostly coded it up. The liftover from GRCh37 to 38 for NC_000001.10:g.144811842C>G works, and it lifts back using PyLiftover.

Cool!!

Problem is, the lifted back coordinate is out. Might be because the variant is in a gene that is inverted in GRCh38 compared to 37. I'll look tomorrow.

You mean, a sense gene became antisense? I really wonder about the quality of the genome at that location, if they have missed something like that.

NM_001102663.1 is even more interesting; gene2transcripts knows it's a NBPF15 transcript, uses it in the LOVD endpoint, but gene2transcripts doesn't actually list it at all!

I suspect the gene symbol in UTA is wrong. Will look. We are gonna start using HGNC number mapping instead in the next build. Make this type of issue go for good!

That would still be awkward because I queried for the transcript itself, so it was mapped to a gene, and then not reverse-mapped anymore. Either way, it shows I also still have a lot of work to do on my end, getting our transcripts updated! I can use HGNC ID as a query in gene2transcripts if needed, so that should be OK.

Peter-J-Freeman commented 4 years ago

You mean, a sense gene became antisense? I really wonder about the quality of the genome at that location, if they have missed something like that.

Yep, here are the PyLiftover outputs. 37 > 38 [('chr1', 149075768, '-', 51897113)] 38 > 37 [('chr1', 148340351, '+', 20849626768)] Input is NC_000001.10:g.144811842C>G

The + or - is the strand I believe.

I suspect that in this region, the difference between 37 and 38 is significant. Annoying the tool doesn't lift back to the same position though. I will contact the authors to see if they can shed light There is a nice blog here http://fouryears.eu/2013/02/25/the-curse-of-genomic-coordinates/

That would still be awkward because I queried for the transcript itself, so it was mapped to a gene, and then not reverse-mapped anymore. Either way, it shows I also still have a lot of work to do on my end, getting our transcripts updated! I can use HGNC ID as a query in gene2transcripts if needed, so that should be OK.

No need, we will use the HGNC ID internally because most users wont use it, but it will be an option. We will also make the new version of UTA available to you. I think you will hit the wall along the way because I know there are genes which do not have a common transcript (at the version level at least) mapped to both GRCh37 and GRCh38. Although, we are trying to put as much legacy data and up-to-date data into out new UTA, so hopefully we can close this gap a bit

Peter-J-Freeman commented 4 years ago

Ok, now testing NC_000001.10:g.148004568C>A

Again it misses and again is in a + / - region [('chr1', 146066338, '+', 90706)] liftback [('chr1', 145368662, '-', 39323927)] NC_000001.10:g.148004568C>A

Peter-J-Freeman commented 4 years ago

Finally NC_000001.10:g.148756540C>G Same Method=PyLiftover [('chr1', 146066460, '-', 400720)] liftback [('chr1', 146215217, '+', 20849626768)] NC_000001.10:g.148756540C>G

Peter-J-Freeman commented 4 years ago

https://github.com/konstantint/pyliftover/issues/13

ifokkema commented 4 years ago

Yep, here are the PyLiftover outputs. 37 > 38 [('chr1', 149075768, '-', 51897113)] 38 > 37 [('chr1', 148340351, '+', 20849626768)] Input is NC_000001.10:g.144811842C>G

The + or - is the strand I believe.

Yeah, so lifting over 37 > 38 > 37 ends up in a different place, which is not good. I could contact the UCSC guys; I was in touch with somebody who was building new chain files (but got sidetracked with COVID-19). Did you compensate for PyLiftOver using 0-based coordinates? Might not matter much here, but is still important if we don't want to be off by one all the time.

We will also make the new version of UTA available to you. I think you will hit the wall along the way because I know there are genes which do not have a common transcript (at the version level at least) mapped to both GRCh37 and GRCh38. Although, we are trying to put as much legacy data and up-to-date data into out new UTA, so hopefully we can close this gap a bit

I'm a bit worried about that; keeping the legacy data in is my best chance for a smooth transition to newer versions of currently used transcripts, or mapping to different transcripts entirely.

Ok, now testing NC_000001.10:g.148004568C>A

Again it misses and again is in a + / - region [('chr1', 146066338, '+', 90706)] liftback [('chr1', 145368662, '-', 39323927)] NC_000001.10:g.148004568C>A

I'm not sure how this score is calculated (last value in the tuple), but 90706 seems rather low compared to other values, suggesting lifting over hg19/GRCh37 to hg38/GRCh38 doesn't work well there.

Finally NC_000001.10:g.148756540C>G Same Method=PyLiftover [('chr1', 146066460, '-', 400720)] liftback [('chr1', 146215217, '+', 20849626768)] NC_000001.10:g.148756540C>G

I guess this makes sense; all my examples were from the same area on chr1. Looks like this whole area is flipped from hg19/GRCh37 to hg38/GRCh38.

Either way, the last two variants are mapped pretty closeby each other on 38, which is strange. They're 750K bases apart on hg19/GRCh37.

Peter-J-Freeman commented 4 years ago

Yeah, so lifting over 37 > 38 > 37 ends up in a different place, which is not good. I could contact the UCSC guys; I was in touch with somebody who was building new chain files (but got sidetracked with COVID-19). Did you compensate for PyLiftOver using 0-based coordinates? Might not matter much here, but is still important if we don't want to be off by one all the time.

Definitely, please do contact them. Yes, all coordinates compensated for. The first version of the tool checked both the reference sequence and the position there and back so was very accurate. I dropped the reference sequence check for one of the other issues we fixed, but don't really want to drop the coordinate check as well.

I wonder if they can explain the tuple better too. The docs on PyLiftover are sparse!

I guess this makes sense; all my examples were from the same area on chr1. Looks like this whole area is flipped from hg19/GRCh37 to hg38/GRCh38.Either way, the last two variants are mapped pretty closeby each other on 38, which is strange. They're 750K bases apart on hg19/GRCh37.

Yep, agreed. Something is screwing over the lift. Again, the tool used is UCSC so hopefully they can shed light. Can you report back please mate?

Peter-J-Freeman commented 4 years ago

p.s. I bet you are loving this as much as I am.

ifokkema commented 4 years ago

I wonder if they can explain the tuple better too. The docs on PyLiftover are sparse!

Yeah, I noticed that, too. I had to go into the source code just to find out what the last value (the liftover score) was. Now I just know it's a score and they sort on it, but I don't know any reference values.

Yep, agreed. Something is screwing over the lift. Again, the tool used is UCSC so hopefully they can shed light. Can you report back please mate?

Sure, I will!

Just to verify, I used the online LiftOver to match your PyLiftOver results, and they are nearly the same.

hg19 input:
chr1:144811841-144811842
chr1:148004567-148004568
chr1:148756539-148756540

hg19 -> hg38
chr1:149075770-149075771
chr1:146066337-146066338
chr1:146066462-146066463

hg38 -> hg19
chr1:148340353-148340354
chr1:145368664-145368665
chr1:146215219-146215220

Minor differences may be due to updated chain files? Anyway, I'll check with my contacts!

p.s. I bet you are loving this as much as I am.

Haha, about half as much, I'm afraid! Even if we find a way to fix this, I foresee an immense amount of work that I'll still have to do after all of this. I am excited to wrap this all up in some writing though :wink:

Peter-J-Freeman commented 4 years ago

Stay in the present mate and celebrate small victories!

Could well be up-to-date files. Might need to see what PyLiftover drags in

Peter-J-Freeman commented 4 years ago

Running my tests. It looks like that for less difficult genes my updated code may have fixed this issue. I'll take a look anf feed back some results

Peter-J-Freeman commented 4 years ago

Yep, it looks like the new method has produced what I consider to be accurate enough mappings to GRCh38 in tests where they were not available before. 78 in total. I have work to do updating the tests AAAAAAAARGH. Ivo, I'm slowly falling out with you HAHAHAHAHAH! :)

ifokkema commented 4 years ago

Cool! I'll try next week...!

AAAAAAAARGH. Ivo, I'm slowly falling out with you HAHAHAHAHAH! :)

Hahahaha! It's because it's Friday! I've been rebuilding tests for weeks now... my head hurts. I'm done with this week, and this week is done with me... Time for me to sign off... have a great weekend! :partying_face:

ifokkema commented 4 years ago

Hey Pete, just checking; all issues remain open so far;

It's not a problem whatsoever (likely we're waiting for Angie/PyLiftOver to respond), just in case I'm missing something since you seemed to mention you fixed something? :thinking: I don't mind at all waiting for Angie/PyLiftOver!

Peter-J-Freeman commented 4 years ago

Hi Ivo, Yes, all still open. I have built the liftover code in, but now optimising it because it has made testing slow then fix the tests. Bit slow going because of teaching and figuring our how best to optimise

Peter-J-Freeman commented 4 years ago

https://github.com/openvar/variantValidator/commit/48690e92c22e2d255b44f52d7c9181c57db72748

Peter-J-Freeman commented 4 years ago

So we have effectively closed for GRCh38 mapping. Gonna look at the orientation issue after fixing https://github.com/openvar/variantValidator/issues/203

Peter-J-Freeman commented 4 years ago

Ok, Referring to

Remap and new liftOver chains to hg19 and hg38. See https://genome-test.gi.ucsc.edu. The track is called "hg38 mapping" or "hg19 mapping" (but I'm considering renaming it to "liftOver alignments", wouldn't that be better?)

Will leave this until discussion with @ifokkema

ifokkema commented 4 years ago

Do these new files generate any different results?

Peter-J-Freeman commented 4 years ago

Hi @ifokkema I have been slammed with grants etc. Are the new files available? If so where can I get them. I can test locally.

Thanks

Peter-J-Freeman commented 4 years ago

p.s. @ifokkema I updated the dev API this morning with the latest VariantValidator version

ifokkema commented 4 years ago

Hi @ifokkema I have been slammed with grants etc. Are the new files available? If so where can I get them. I can test locally.

Getting them out of the table browser isn't as easy as I thought, so I have asked for downloads :)

p.s. @ifokkema I updated the dev API this morning with the latest VariantValidator version

Thanks!

Peter-J-Freeman commented 2 years ago

@ifokkema .Wonder if these were ever d=deployed???

ifokkema commented 2 years ago

Seems like the mappings are still missing. I again updated the links.