noi-techpark / bdp-core

Open Data Hub / Timeseries Core
https://opendatahub.com
Other
9 stars 4 forks source link

As an AI expert, I would need to overcome the constraint of the Open Data Hub to also write data in the past, not just in the future. #277

Closed rcavaliere closed 5 months ago

clezag commented 7 months ago

@rcavaliere from what @dulvui told me, there are two use cases:

  1. Adding missing historical data that we only received at a later date.
    This should be fairly easy to do: On the writer API side there is a check that you cannot push data that's older than the most recent, but we could implement a flag that disables the check for that push. The data collector would be responsible to ensure data consistency in that case (e.g. duplicates, near-duplicates). We would push directly into history, and not update the current measurement tables if there are newer records already. If you have a use case already, we can do it in tandem with the necessary back end changes.

  2. In case of predictions (like weather or parking) that write into the future, it should still be possible to push records on the timeline preceding those predictions, or add newer predictions about the same date.
    Here we agree that this is a much more complex issue, and we'd like to discuss the specific use case with you. We have a similar problem already with parking forecasts. Our current data model does not support these kinds of predictions well, we will either have to extend it or find an acceptable workaround. In any case we would like to find a general solution on how we handle any kind of prediction data going forward.

rcavaliere commented 7 months ago

@clezag thanks for you feedback. Just an additional question, in the meantime: is this rule defined at station level or at data type level? If we have this at this second level, that we won't have issues at all.

clezag commented 7 months ago

@rcavaliere It first loads the most recent record matching [ station id + data type + period ] from the DB, and then compares that timestamp against each record being pushed.

(Researching your question I've found a potential bug where it doesn't always correctly consider the period https://github.com/noi-techpark/bdp-core/issues/278)

I think we will still have issues if we want to keep outdated prediction data and not overwrite it. e.g. if I do weather prediction for the next 5 days, every day will have 5 predictions, coming from the last 5 days before it. Somehow I have to distinguish these 5, in a way that is clear to the user.

Handling it with data types (e.g. weather-forecast-1-day, weather-forecast-2-day) could be one solution we've discussed, but it's not very extendable or dynamic if your forecasting periods are anything other than a few days or hours.

rcavaliere commented 7 months ago

@clezag I have today afternoon briefly discussed the topic with Simon. I get the point, however for simplicity we could also think to just consider the first set of predictions we get for a certain time interval, without foreseeing any overwrite possibilities. Or we just say that we just import a subset of all future predictions available (e.g just next day out of next 5 days), since we want to consider the last (and maybe more precise) prediction available in the time

clezag commented 7 months ago

@rcavaliere I agree. It there's no overlapping predictions and it should be fine as it is already.

I do think we will have to tackle this topic eventually, though, it depends on your project constraints if now is the right time.

ohnewein commented 5 months ago

@rcavaliere should this issue been closed? You found another way to handle this together with @clezag , right?

rcavaliere commented 5 months ago

@ohnewein yes we can close it!