vanallenlab / moalmanac-browser

Web portal interface for browsing the Molecular Oncology Almanac
https://moalmanac.org
GNU General Public License v2.0
1 stars 1 forks source link

clarification on seemingly duplicate entries in MOAlmanac #36

Closed ahwagner closed 2 years ago

ahwagner commented 2 years ago

I noticed that there are several entries that appear as duplicates on the web interface, such as Assertions 6 and 7.

Looking into the full-detail records from the API, it appears that the only difference between these two records is the context field and the feature ID:

2,3c2,3
<   "assertion_id": 7,
<   "context": "Resistance or intolerance to prior therapy",
---
>   "assertion_id": 6,
>   "context": "Chronic phase",
19c19
<       "feature_id": 7,
---
>       "feature_id": 6,

However, features 6 and 7 (and a few others, e.g. 1, 8, 9, 10) are otherwise identical, each with the following attributes:

{
  "attributes": [
    {
      "feature_type": "rearrangement",
      "gene1": "BCR",
      "gene2": "ABL1",
      "locus": null,
      "rearrangement_type": "Fusion"
    }
  ],
  "feature_type": "rearrangement"
}

From this I have two areas of question.

First, regarding context field, what is the intent of this field? How should the content be interpreted, or inform interpretation of assertions?

Second, why are there otherwise identical features present with different feature_id values? Using the features endpoint returns only unique entries, but feature_id: 7 and feature_id: 8 are not present in that set, only the otherwise identical feature_id: 1. Why are these ids used in assertions but not represented in the features endpoint?

Thanks in advance!

brendanreardon commented 2 years ago

Hi Alex,

Thank you for checking out our resource. I'll address each of these questions independently but the large answer is that we have an entire rework of the API and website on the to do list. There is a fair amount of technical debt with how it is currently set up which lends to an unideal browsing experience.

Looking into the full-detail records from the API, it appears that the only difference between these two records is the context field and the feature ID

Correct, these two assertions differ by the context field so they are currently stored separately.

First, regarding context field, what is the intent of this field? How should the content be interpreted, or inform interpretation of assertions?

The intent of this field is to store the clinical context of an assertion -- e.g. phase of disease or if a given relationship is only approved after prior therapies, as is the case with these two examples. From our S.O.P. the description is rather brief, "clinical context of the assertion as written in the associated source". This information is usually captured in the description of the assertion from the source but we've extracted it out as necessary in case it is something that we want to highlight further in the future.

Thus, this would inform the interpretation of an assertion by providing additional clinical context to the reader. In the case of an FDA approval, maybe the cancer type and the genomics match but the molecular profile's corresponding tumor has not been exposed to a subsequent therapy.

Second, why are there otherwise identical features present with different feature_id values? Using the features endpoint returns only unique entries, but feature_id: 7 and feature_id: 8 are not present in that set, only the otherwise identical feature_id: 1. Why are these ids used in assertions but not represented in the features endpoint?

This is entirely due to technical debt. The website and API (this repository) use a SQLite database but the database contents are primarily stored in a flat format to aid in accessibility for revising the database contents. Duplicate feature ids can be individually queried (e.g. /api/features/7 and /api/features/8). We decided to have the /features endpoint return only unique values in the interim until we redo the backend.

Was this helpful? Technical debt is never a satisfying answer but I hope that these responses answer your questions.

ahwagner commented 2 years ago

Thanks @brendanreardon. Just noticed #35 as well–I should have read there first and it would have answered my questions about features / feature IDs. Understanding the technical debt is sufficient to help us capture the intent behind the provided identifiers and how we might represent them downstream.

Similarly, the additional information around context is sufficient for our data modeling purposes.

Closing this issue as my questions have been addressed–feel free to re-open if helpful to you.