Add colocalisation to API

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

Add colocalisation to API #3441

Open jdhayhurst opened 1 week ago

jdhayhurst commented 1 week ago

As a user I want to be able to access colocalisation data for credible sets in the API.

Background

Here is the colocalisation schema from gentropy. each object contains the product of two study locus ids, which are represented as "left" and "right", although this assignment is arbitrary and carries no semantic meaning.

Tasks

[x] update POS to create a new colocalisation index in open search.
[x] add colocalisation list type object to credibleSet, resolved where studyLocusId is either "left" or "right"
[ ] the colocalisation object should consistently have one field for the fixed studyLocusId that it was resolved on and another for the other. This is opposed to allowing the fixed studyLocusId to be represented as either the "left" or the "right" id.

Acceptance tests

How do we know the task is complete?

When I query the API for a credibleSet, I can also retrieve colocalisation objects within.

jdhayhurst commented 1 week ago

@opentargets/data-team when you have colocalisation dataset that I can use for developing against, please could you send me a link? Thanks!

d0choa commented 1 week ago

@DSuveges confirmed we will patch the studyType for now before the actual work happens in #3444

jdhayhurst commented 6 days ago

@opentargets/data-team what do you think about replacing leftStudyLocusId and rightStudyLocusId with a single tuple containing the pair, e.g. studyLocusPair? First, it would be cleaner in the sense that "left" and "right" have no semantic meaning here and second, it would be far simpler to create views where the studyLocusId is fixed, because we don't need to know if it is the left or the right.

DSuveges commented 3 days ago

As a one-off data prep, I created a patched coloc dataset with the right columns:

root
 |-- leftStudyLocusId: long (nullable = true)
 |-- rightStudyLocusId: long (nullable = true)
 |-- rightStudyType: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- colocalisationMethod: string (nullable = true)
 |-- numberColocalisingVariants: long (nullable = true)
 |-- h3: double (nullable = true)
 |-- h4: double (nullable = true)
 |-- clpp: double (nullable = true)

As discussed on 11.09 meeting, we are not propagating h0, h1, h2 and log2h3h4, however rightStudyType is added.

The data in json is here: gs://ot-team/dsuveges/colocalisation/coloc.json (24GB)

@jdhayhurst - please give it a try and let me know if there's anything off or if there's any question. @addramir - please let me know if we need other columns in this dataset. (credible-set, study info can be resolved)

jdhayhurst commented 2 days ago

Hi @DSuveges, just checking if these fields are nullable?

 |-- leftStudyLocusId: long (nullable = true)
 |-- rightStudyLocusId: long (nullable = true)
 |-- rightStudyType: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- colocalisationMethod: string (nullable = true)
 |-- numberColocalisingVariants: long (nullable = true)

In the previous schema they were not nullable

jdhayhurst commented 2 days ago

Also, is it possible to have all ids as String types. We could handle String and Long IDs in the backend, but it basically involves reading as a number and then casting everything into a String type. If the value doesn't need to be numerical and sole purpose is as an identifier I think it should be String in the data itself.

DSuveges commented 2 days ago

Regarding nullability: these datasets are read from parquet files, and the parquet format doesn't preserve nullability information. Yes, those columns are complete in a sense that all fields are expected to be non-nullable.

Yes, you are right, the leftStudyLocusId and rightStudyLocusId are all going to be of string type. Any time we were generating credible set or credible set derived datasets for you, these were all manufactured. As soon as the ochestration is in place, these fields will be in the right type. Please give me a sec, I'll update these columns.