vanallenlab / moalmanac-browser

Web portal interface for browsing the Molecular Oncology Almanac
https://moalmanac.org
GNU General Public License v2.0
1 stars 1 forks source link

Assertion ID changing over time #46

Open cgrisdale opened 1 month ago

cgrisdale commented 1 month ago

Hi, I noticed that some IDs from MOA assertions accessed in 2022 have changed. For example, PAK1 amplification association with reduced benefit from tamoxifen is currently listed under assertion_id 756, but data from 2022 had an ID of 768. ID 768 is now associated with a completely different assertion for pembrolizumab in MSI-high solid tumours. I've seen a few other examples of this but am not sure how prevalent it is across the database. Is it possible this is a technical issue, or are IDs ever reused if assertions are deleted and new ones created? Do you have any backups of assertions that we could use to assess where the changes have occurred? Any insight you have into this would be much appreciated. Thanks!

brendanreardon commented 1 month ago

Hi Cameron!

Ah, yes - a collaborator noticed this bug back in March of this year and so we added a "_deprecated" field in the April 2024 release of the database content.

Do you know what your date of access was? We have all releases of the database content on GitHub, going back as far as November 2020.

cgrisdale commented 1 month ago

Thanks for the quick reply Brendan.

I accessed the data in early 2022, so the release page has content from that time. However, I don't see the assertion_id field in the archived content files. Can this be found anywhere?

mathieulemieux commented 1 month ago

Hi @brendanreardon !

Same question as my collegue @cgrisdale here!

Here is a record in the 2024-10-03 archive: archive

Compare to a record using the REST API (e.g. https://moalmanac.org/api/assertions/1), I cannot find the "assertion_id" field. Are these ids intended to follow the evolution of a given record from a release to another, or we shouldn't rely on them?

brendanreardon commented 1 month ago

Ah. In short - since this year's April release you can rely on the ids being consistent across releases. The ids are generated when moving the database content over to the web server but, you're right -- we should add them into the JSONs too.

The longer answer is that we did not intend for people to rely on the ids, and our current data structure is a bit of an artifact of how we initially set up the web browser. We're hoping to soft launch our "2.0" API by the end of the year which will have reliable ids from the get-go and roughly follow the structure from GA4GH's GK pilot.

At the moment, the "assertion_id" and "feature_id" both just refer to the index of the given record in the JSON, 1-indexed. So, we'll aim to minimally add the assertion_id into the JSON for the November release.

Does that help at all? In shorter - technical debt 😅

mathieulemieux commented 1 month ago

Yes it helps a lot, thank you @brendanreardon ! Can't wait to use the new API!

cgrisdale commented 1 month ago

Thank you @brendanreardon for that explanation.