opendatadiscovery / odd-platform

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
https://opendatadiscovery.org
Apache License 2.0
1.16k stars 96 forks source link

Last Modified Option Within API Calls #1680

Closed clintjb closed 1 month ago

clintjb commented 1 month ago

Is your proposal related to a problem?

Today were pulling enormous amounts of data from ODD via the API (works brilliantly!) one of the issues we face however is that the data we require means we usually pull the detailed calls on the dictionary terms as well as dataset views.

As the number of datasets and terms increase it means were enormously inefficient basically calling everything even if there hasn't been anything changed modified.

Describe the solution you'd like

Ideally within the API calls the option to see a last modified date so we could pull a full list of the datasets / terms and identify which ones have been modified before pulling those detailed calls specifically.

As an example if we utilise the call /api/terms (list of terms) I will get something like the following: "id": 5, "name": "XXX_VTTK_BOX_TEXT1", "definition": "Blah Blah Blah", "modiifed_date": "2024-05-28T11:40:16.919Z", "namespace": { "id": 2, "name": "Warehouse"

Ideally it would be good to have another variable there which was a modification date (both for terms and datasets)

RamanDamayeu commented 1 month ago

Hi,

  1. For terms we will work on adding "update_at" column (to follow the internal naming) to the results of this "/api/terms" endpoint for each item and also we will add additional parameters to the endpoint like "update_at_range_start_date_time" (optional) -- for the start of interval that includes value; and "update_at_range_end_date_time" (optional) - for the end of the interval that excludes the values. So the items would be with "update_at" >= update_at_range_start_date_time and "update_at" < update_at_range_end_date_time.
  2. For data entities (and datasets in particular) we could suggest to use /activity/getActivity to track some (not all) modifications (this list of activities is tracked https://github.com/opendatadiscovery/odd-platform/blob/5ca558bfa79082b15cf9ed11847eb75de4e04e2f/odd-platform-api/src/main/java/org/opendatadiscovery/oddplatform/dto/activity/ActivityEventTypeDto.java#L3). Unfortunately, when we ingest data we do not track date of modification for most of the attributes (including metadata, description, etc.) the only part that is tracked is dataset structure: there is an endpoint /api/datasets/{data_entity_id}/structure (Get latest version's DataSet structure information) that has data_set_version.created_at attribute that shows when the version of the structure has been created. But I agree in that case we still need to look over all {data_entity_id} to find out what structures have been changed since some date in the past. So if getActivity doesn't cover the requirements it will be tough to create an additional attribute with the last modification date for datasets. We'll need to put a bit more effort to design the feature.
clintjb commented 1 month ago

Understood - for the terms it would already be enormously helpful, can appreciate this is a very tricky request / with significant implications

Vladysl commented 1 month ago

implemented in https://github.com/opendatadiscovery/odd-platform/pull/1685

RamanDamayeu commented 1 month ago

As discussed at https://github.com/opendatadiscovery/odd-platform/issues/1680#issuecomment-2137353813 we're going to implement in release 0.27.0 only part for term, in particular:

"update_at"property in items with "/api/terms" responses update_at_range_start_date_time and update_at_range_end_date_time input parameters to filter out responses based on update_at property of terms The part for data entities needs much more redesign and effort so moved from the scope as of now.