opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

EVA: Duplicate association fields — source trait names #1013

Closed tskir closed 3 years ago

tskir commented 4 years ago

This ticket is part of the effort to get rid of duplicates in unique_association_fields in the evidence strings submitted by EVA.

Sometimes, different ClinVar trait names are mapped to the same ontology term. For example, record RCV000763237.1 associates one variant, rs151045328, with three conditions:

  1. Usher syndrome, type 1C
  2. Deafness, autosomal recessive 18
  3. Usher syndrome, type 1

We currently have the 1st and the 3rd one mapped to the same ontology term, Orphanet_886. Hence, we have two almost identical evidence strings where only the source name of the trait is different.

To bring some context about the scale, currently we have 249 clusters of evidence strings with the same association fields because of this issue.

The solution which I suggest to implement for this is to add phenotype_source_name to the list of unique_association_fields. This means we would still have the same number of evidence strings, but they will have different association fields. This can be tracked & discussed in this ticket.

AsierGonzalez commented 4 years ago

Your suggested solution would fix "the issue" but I am not sure that I would add much value. As you say, in this case two ClinVar conditions ("Usher syndrome, type 1C" and "Usher syndrome, type 1") are mapped to the same ontology term ("Usher syndrome") which produces duplicates. However, the duplication does not mean that the association is lost, only that one of the two evidence strings is used. You can see this in the association page for USH1C and Usher syndrome (the evidence for RCV000763237 is in the last row in the screenshot below):

Screenshot 2020-05-12 at 16 27 58

If we deduplicate the evidence strings for the two conditions we would get two rows for RCV000763237 and I don't think that is that useful. Ideally, we would want those two phenotypes mapped to two independent ontology terms if they are to be considered two separate subtypes of the disease.

iandunham commented 4 years ago

If the disease are genuinely different then we should ask for additional ontology terms. It needs someone to understand the relationships of Type 1 and Type 1C, and edit the ontology. Also Deafness, autosomal recessive 18 may not be a disease but an OMIM term for an association which is equivalent to Usher Syndrome Type 1 or Type 1C. Unfortunately there are historical OMIM terms like this

tskir commented 4 years ago

@AsierGonzalez @iandunham Right, so I think we have two separate issues here:

When two trait names currently point to the same ontology term, but should actually point to two different ones, then I agree this is a curation issue and should be fixed at this level. Currently we have no way of doing this because of the archaic curation approach, but the new web-based system should allow us to keep track of all traits in fine detail. So we can gradually go through all such cases and fix them.

However, what do we do in cases where two names actually should point to the same ontology and do represent the same condition? I don't like the idea of our evidence strings having non-unique unique_association_fields. Should we, then, issue only one evidence string for a set of traits pointing to the same ontology? If yes, then how do we decide what to put into disease.source_name field, which currently contains the original trait name from ClinVar?

AsierGonzalez commented 4 years ago

If two ClinVar phenotypes point to the same ontology term in the same ClinVar accession then it is right to consider them duplicates. With the current approach (no phenotype source name in the unique association fields) our pipeline would discard one and there would not be any issues. As I said, I don't see the value of adding the source name to the unique association fields just to distinguish two different ways of calling the same phenotype. Do you see it differently?

tskir commented 4 years ago

In light of the clarifications you provided, I agree that while simply adding source name to the unique association fields will technically solve the problem, it will not add any value. So let's not do that.

I still want to avoid having duplicates in the unique association fields in the evidence strings which we submit to you. Partly because it's just not right to have duplicates in a field which is supposed to be unique, and partly because this actually complicates computing differences and bulk statistics.

From what you say, it seems that the current approach for dealing with this on your side is to choose one (arbitrary?) evidence string from a cluster of strings having the same unique_association_fields. In this case, what we could do is implement a similar thing on our side as well, so that the strings which you receive are already duplicate-free. This also has an added benefit that the deduplication on our side can be controlled and fine-tuned.

If you can confirm you're happy with this solution, then we can close this issue, and I'll add an internal issue to our backlog.

tskir commented 4 years ago

Also, I have created a relevant issue in the repository for the new trait curation web app we are developing: https://github.com/EBIvariation/trait-curation/issues/12. It will provide an opportunity to view all ontology terms which have mappings from multiple trait names.

tskir commented 4 years ago

Also also, I just want to share a quote from Gautier Koscielny from Open Targets, who in 2018 commented:

One possible explanation is because the disease mapping process in EVA has mapped different ClinVar diseases to the same EFO/ORDO term.

If you put the source name of the disease in the unique association field, they should be unique.

That’s why I always asked the data providers to provide the original label of the disease in the submission as a way to understand how the mapping was performed.

This is from our 2018 ticket dealing with unique_association_fields duplicates. This quote provides some background as to why I originally proposed to include the source name to the association fields. But apparently the approach to deduplication has changed since then (which is of course fine).

AsierGonzalez commented 4 years ago

Summary of discussion in EVA-OT meeting held on 9-06-20

In the future, when multiple ClinVar conditions map to the same EFO id all the ClinVar disease names will be listed in an array in ".disease.source_name":

AsierGonzalez commented 4 years ago

This commit to branch ag_eva_changes adds support for arrays of strings. The two alternatives below are now valid:

"disease": {
    "id": "http://www.orpha.net/ORDO/Orphanet_886",
    "source_name": "usher syndrome, type 1c",
    "name": "Usher syndrome"
  },

And

"disease": {
    "id": "http://www.orpha.net/ORDO/Orphanet_886",
    "source_name": ["usher syndrome, type 1c", "Usher syndrome, type 1"],
    "name": "Usher syndrome"
  },
andrewhercules commented 4 years ago

Re-assigning to the next sprint so that PR can be opened and approved

AsierGonzalez commented 4 years ago

I leave the ticket open until the next time we receive an EVA evidence file

AsierGonzalez commented 4 years ago

This has not been implemented and it isn't prioritised for 20.11 - moving it to 21.02

AsierGonzalez commented 3 years ago

@tskir we have revisited this issue and the solution discussed above does not apply anymore. We are working on a new ETL pipeline and as a consequence we are making important changes to the JSON schema (more info in #1249). We have reviewed the solution we discussed here, i.e. using an array in disease.source_name but that causes issues downstream. Therefore we have decided to use a different approach that will also apply to other data sources.

When multiple phenotypes/ conditions map to the same EFO, they will be captured in a single evidence string but the source_name will only contain one of them, randomly chosen. There will be another field where the full list of original disease names will be included in an array, not only the ones that map to the given EFO id.

I hope this does not cause any issues to you. We will get in touch with you soon to explain the changes.

tskir commented 3 years ago

@AsierGonzalez Thank you for the update! I have not yet done any substantial work on this issue, so no time is wasted.

The only part which bugs me a bit is choosing one source_name on random—sounds like it might somewhat undermine the stability/reproducibility of the results. But of course we'll be able to implement this in whichever way your pipeline require

AsierGonzalez commented 3 years ago

You are right in that using a random source_name may make it difficult to compare the output of different runs. However, any differences should not cause internal issues as the only use we have in mind for it is to display it in the webapp and we assume that all the conditions that map to the same EFO id will be very similar. The only case I can think of that could cause difficulties is if a subtype of a disease (e.g. Usher syndrome, type 1C that started this discussion) is mapped to a more general term in one release (causing a duplicate in the current implementation) and in the next release it gets mapped to another disease because it has been added to the ontology. Can you think of any other problematic cases?

One solution to ameliorate the issue could be not to choose randomly the source_name to be reported but to pick the first one alphabetically or the shortest or longest. At least this will avoid changes due to different ordering in the input file when nothing else changes.

tskir commented 3 years ago

Yes, choosing the first name in lexicographical order does sound better to me than just being random.

I can't think of other problematic cases right now, although I'm still not sure I understand the value behind choosing a designated source name to display. Do you suppose you could share a screenshot of how this data would look on the web page?

AsierGonzalez commented 3 years ago

Good, let's use the first one in lexicographical order then.

To be honest, it's still not decided whether this value will be displayed in the front-end but we want all the data sources to be as consistent as possible in the new schema, which is why having an array of strings was an issue.

We will keep our minds open when we start using the new schema and pipeline in case we identify any issues.

ireneisdoomed commented 3 years ago

This has not been implemented yet and it is not scheduled to be done for 21.02 - moving it to 21.04.

ireneisdoomed commented 3 years ago

The above-mentioned solution has been changed in the light of the migration of ClinVar to the v2 of the JSON Schema (more details here). The current approach to report the conditions in ClinVar from 21.04 on is the following:

An interim solution has been implemented so far where cohortPhenotypes lists the preferred name of all the different traits annotated for a given RCV since providing a complete list is not straightforward as @tskir has commented:

The cohortPhenotypes field will not yet contain all disease names from ClinVar. This is because trait representation in ClinVar is complex and will require additional investigations, which might not be complete in time for the release. In short, each RCV record contains a TraitSet, which in general has multiple Trait records, each of which in general has multiple Name entries (synonyms). We're currently only using one Name (marked as “preferred”) per Trait. So if there are three Traits in a TraitSet, the cohortPhenotypes will list only the three preferred names, one for each, regardless of how many synonym names each of the Traits had.