Open jdhayhurst opened 1 week ago
@opentargets/data-team when you have colocalisation dataset that I can use for developing against, please could you send me a link? Thanks!
@DSuveges confirmed we will patch the studyType
for now before the actual work happens in #3444
@opentargets/data-team what do you think about replacing leftStudyLocusId
and rightStudyLocusId
with a single tuple containing the pair, e.g. studyLocusPair
? First, it would be cleaner in the sense that "left" and "right" have no semantic meaning here and second, it would be far simpler to create views where the studyLocusId is fixed, because we don't need to know if it is the left or the right.
As a one-off data prep, I created a patched coloc dataset with the right columns:
root
|-- leftStudyLocusId: long (nullable = true)
|-- rightStudyLocusId: long (nullable = true)
|-- rightStudyType: string (nullable = true)
|-- chromosome: string (nullable = true)
|-- colocalisationMethod: string (nullable = true)
|-- numberColocalisingVariants: long (nullable = true)
|-- h3: double (nullable = true)
|-- h4: double (nullable = true)
|-- clpp: double (nullable = true)
As discussed on 11.09 meeting, we are not propagating h0
, h1
, h2
and log2h3h4
, however rightStudyType
is added.
The data in json is here: gs://ot-team/dsuveges/colocalisation/coloc.json
(24GB)
@jdhayhurst - please give it a try and let me know if there's anything off or if there's any question. @addramir - please let me know if we need other columns in this dataset. (credible-set, study info can be resolved)
Hi @DSuveges, just checking if these fields are nullable?
|-- leftStudyLocusId: long (nullable = true)
|-- rightStudyLocusId: long (nullable = true)
|-- rightStudyType: string (nullable = true)
|-- chromosome: string (nullable = true)
|-- colocalisationMethod: string (nullable = true)
|-- numberColocalisingVariants: long (nullable = true)
In the previous schema they were not nullable
Also, is it possible to have all ids as String types. We could handle String and Long IDs in the backend, but it basically involves reading as a number and then casting everything into a String type. If the value doesn't need to be numerical and sole purpose is as an identifier I think it should be String in the data itself.
Regarding nullability: these datasets are read from parquet files, and the parquet format doesn't preserve nullability information. Yes, those columns are complete in a sense that all fields are expected to be non-nullable.
Yes, you are right, the leftStudyLocusId
and rightStudyLocusId
are all going to be of string type. Any time we were generating credible set or credible set derived datasets for you, these were all manufactured. As soon as the ochestration is in place, these fields will be in the right type. Please give me a sec, I'll update these columns.
As a user I want to be able to access colocalisation data for credible sets in the API.
Background
Here is the colocalisation schema from gentropy. each object contains the product of two study locus ids, which are represented as "left" and "right", although this assignment is arbitrary and carries no semantic meaning.
Tasks
colocalisation
index in open search.colocalisation
list type object tocredibleSet
, resolved wherestudyLocusId
is either "left" or "right"colocalisation
object should consistently have one field for the fixedstudyLocusId
that it was resolved on and another for the other. This is opposed to allowing the fixedstudyLocusId
to be represented as either the "left" or the "right" id.Acceptance tests
How do we know the task is complete?