openml / server-api

Python-based server
https://openml.github.io/server-api/
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Avoid storing duplicate information in the database #87

Open PGijsbers opened 11 months ago

PGijsbers commented 11 months ago

Information may be stored in multiple times in the database, this came to light in https://github.com/openml/openml-python/issues/1289#issuecomment-1792250138. We should avoid storing duplicate information in the database, because it can easily lead to multiple truths. This issue can be used to keep track of all duplicate data, with the intention to refactor our database in the future to avoid these pitfalls:

amueller commented 10 months ago

I assume this was done for efficiency, and we should be using automatic view materialization instead? Or was that on accident?

PGijsbers commented 10 months ago

I wasn't involved with the database design, so I can't comment on why the duplication exists. I hope to discuss this with Jan later, but changes to the database likely won't happen yet in the next few months as we are focusing on a (mostly) faithful reimplementation of the PHP REST API first. While this issue doesn't specifically mention it, potential changes to the database will be benchmarked and put into context with usage statistics, which helps us evaluate the alternatives. But in principle the change outlined is something that should be looked at.