kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
94 stars 90 forks source link

Offer public API to get dataset info? #926

Open SajidAlamQB opened 2 weeks ago

SajidAlamQB commented 2 weeks ago

Description

Related to: https://github.com/kedro-org/kedro-viz/issues/1893#issuecomment-2456833233

When trying to retrieve metadata (like file size) from datasets in Kedro Viz we are running into issues, specifically within our DatasetStatsHook. Currently, to obtain this metadata, we need to access private attributes (e.g., _filepath, _fs, _protocol), which is causing inconsistency and compatibility problems with datasets that do not expose these attributes.

Context

By having a standardised, public way to retrieve dataset metadata we can improve best practices. This benefits users who want to collect dataset metadata or build plugins, as it provides a consistent and maintainable approach.

Possible Implementation

Perhaps adding an optional, public method (e.g., get_info) to datasets that can provide metadata like file size, etc.

MinuraPunchihewa commented 1 week ago

Hey @SajidAlamQB, I am happy to work on this. Can you provide some information on what kind of metadata would need to be retrieved exactly? Is there a list?

SajidAlamQB commented 1 week ago

Hi @MinuraPunchihewa,

Thank you for your response and for offering to work on this!

I think some of the metadata we're Interested in is:

@astrojuanlu did you have any thoughts on what else?

astrojuanlu commented 1 week ago

paging @bielstela since you opened https://github.com/kedro-org/kedro-viz/issues/1714 :)

before proceeding with get_info though (which was a strawman proposal), given that there's a AbstractDataset._describe, should we actually have a describe for this purpose, and reuse what's already being returned? Looking at a random dataset, we have a bunch of properties there already

https://github.com/kedro-org/kedro-plugins/blob/57f6279c403b642f2a9b224403c84685df6d3473/kedro-datasets/kedro_datasets/svmlight/svmlight_dataset.py#L168-L175

MinuraPunchihewa commented 1 week ago

@astrojuanlu This makes sense to me. I can reuse this and add any extra metadata that would be required. Is there anything outside of the list that @SajidAlamQB has given above?

astrojuanlu commented 6 days ago

Would love to know @merelcht and @rashidakanchwala opinion before proceeding

astrojuanlu commented 6 days ago

(Specifically, whether we should add a dedicated public get_info or if it should be describe and wrap _describe)

rashidakanchwala commented 3 days ago

I missed this but I love this! It would really help Kedro-viz; though we need to decide what it means for each dataset. For instance, for SQLDatasets, could it get_info provide SQL Query etc.

astrojuanlu commented 3 days ago

We discussed this in backlog grooming and we acknowledge that the original problem (that Viz needs private data from datasets https://github.com/kedro-org/kedro-viz/issues/1893#issuecomment-2432719034) exists and we want to address it.

There was past discussion about _describe when we implemented preview and it was discarded https://github.com/kedro-org/kedro-plugins/pull/504#issuecomment-1924190370

We didn't feel like adding more Viz-specific public methods, although an exception was made for said preview method since it wasn't part of AbstractDataset https://github.com/kedro-org/kedro-plugins/pull/504#issuecomment-1929357128

This problem was already anticipated when implementing the preview functionality https://github.com/kedro-org/kedro-viz/issues/662#issuecomment-1646003634 and re-discovered by a user some months later https://github.com/kedro-org/kedro-viz/issues/1714

Renaming this issue so it's more clear that this is an open problem.