kedro-org / kedro-viz

Visualise your Kedro data and machine-learning pipelines and track your experiments.
https://demo.kedro.org
Apache License 2.0
646 stars 106 forks source link

"Dataset could not be previewed" #1847

Open astrojuanlu opened 3 months ago

astrojuanlu commented 3 months ago

Description

I tried adding preview support to a custom dataset like this:

diff --git a/conf/base/catalog.yml b/conf/base/catalog.yml
index 5a9adb8..aef45cf 100644
--- a/conf/base/catalog.yml
+++ b/conf/base/catalog.yml
@@ -2,6 +2,10 @@ submissions-raw:
   type: reddit_summarizer.datasets.DeltaPolarsDataset
   filepath: s3://reddit-submissions/submissions-raw
   credentials: minio_object_store
+  metadata:
+    kedro-viz:
+      preview_args:
+        nrows: 5

 submissions-summaries:
   type: reddit_summarizer.datasets.DeltaPolarsDataset
diff --git a/src/reddit_summarizer/datasets/delta_polars_dataset.py b/src/reddit_summarizer/datasets/delta_polars_dataset.py
index e6f0fff..a9a2724 100644
--- a/src/reddit_summarizer/datasets/delta_polars_dataset.py
+++ b/src/reddit_summarizer/datasets/delta_polars_dataset.py
@@ -3,6 +3,8 @@ import typing as t
 import polars as pl
 from kedro.io import AbstractDataset

+from kedro_datasets._typing import TablePreview
+

 class DeltaPolarsDataset(AbstractDataset[pl.DataFrame, pl.DataFrame]):
     """``DeltaDataset`` loads/saves data from/to a Delta Table using an underlying
@@ -48,3 +50,10 @@ class DeltaPolarsDataset(AbstractDataset[pl.DataFrame, pl.DataFrame]):
             storage_options=self._storage_options,
             metadata=self._metadata,
         )
+
+    def preview(self, nrows: int) -> TablePreview:
+        subset = self._load().head(nrows)
+        df_dict = {}
+        for column in subset.columns:
+            df_dict[column] = subset[column]
+        return df_dict

And yet I got an error:

[04/08/24 09:52:40] WARNING  'submissions-raw' could not be previewed. Full exception: TypeError: DeltaPolarsDataset.preview() missing 1 required positional argument:      flowchart.py:814
                             'nrows'

What am I doing wrong?

Your Environment

Include as many relevant details as possible about the environment you experienced the bug in:

Checklist

rashidakanchwala commented 3 months ago

The code looks right to me. @SajidAlamQB , since u added 4 dataset previews recently, did you encounter this error? Is there anything u think looks incorrect?

SajidAlamQB commented 3 months ago

Maybe you could try setting a default value for nrows in the method definition nrows: int = 5.

astrojuanlu commented 3 months ago

Setting a default value avoids that error, although there must be something fishy happening here:

https://github.com/kedro-org/kedro-viz/blob/3eeb5657fccb579187b8156670da6720c21bc2c5/package/kedro_viz/models/flowchart.py#L806-L811

On the other hand, now I get another error, but that's unrelated to this issue:

``` [04/08/24 16:31:28] ERROR Exception in ASGI application httptools_impl.py:424 ╭────────────────────────────── Traceback (most recent call last) ───────────────────────────────╮ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/uvicorn/protocols/http/httptools_impl.py:419 in run_asgi │ │ │ │ 416 │ # ASGI exception wrapper │ │ 417 │ async def run_asgi(self, app: "ASGI3Application") -> None: │ │ 418 │ │ try: │ │ ❱ 419 │ │ │ result = await app( # type: ignore[func-returns-value] │ │ 420 │ │ │ │ self.scope, self.receive, self.send │ │ 421 │ │ │ ) │ │ 422 │ │ except BaseException as exc: │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/uvicorn/middleware/proxy_headers.py:84 in __call__ │ │ │ │ 81 │ │ │ │ │ port = 0 │ │ 82 │ │ │ │ │ scope["client"] = (host, port) # type: ignore[arg-type] │ │ 83 │ │ │ │ ❱ 84 │ │ return await self.app(scope, receive, send) │ │ 85 │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/fastapi/applications.py:1054 in __call__ │ │ │ │ 1051 │ async def __call__(self, scope: Scope, receive: Receive, send: Send) -> None: │ │ 1052 │ │ if self.root_path: │ │ 1053 │ │ │ scope["root_path"] = self.root_path │ │ ❱ 1054 │ │ await super().__call__(scope, receive, send) │ │ 1055 │ │ │ 1056 │ def add_api_route( │ │ 1057 │ │ self, │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/applications.py:123 in __call__ │ │ │ │ 120 │ │ scope["app"] = self │ │ 121 │ │ if self.middleware_stack is None: │ │ 122 │ │ │ self.middleware_stack = self.build_middleware_stack() │ │ ❱ 123 │ │ await self.middleware_stack(scope, receive, send) │ │ 124 │ │ │ 125 │ def on_event(self, event_type: str) -> typing.Callable: # type: ignore[type-arg] │ │ 126 │ │ return self.router.on_event(event_type) # pragma: nocover │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/middleware/errors.py:186 in __call__ │ │ │ │ 183 │ │ │ # We always continue to raise the exception. │ │ 184 │ │ │ # This allows servers to log the error, or allows test clients │ │ 185 │ │ │ # to optionally raise the error within the test case. │ │ ❱ 186 │ │ │ raise exc │ │ 187 │ │ │ 188 │ def format_line( │ │ 189 │ │ self, index: int, line: str, frame_lineno: int, frame_index: int │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/middleware/errors.py:164 in __call__ │ │ │ │ 161 │ │ │ await send(message) │ │ 162 │ │ │ │ 163 │ │ try: │ │ ❱ 164 │ │ │ await self.app(scope, receive, _send) │ │ 165 │ │ except Exception as exc: │ │ 166 │ │ │ request = Request(scope) │ │ 167 │ │ │ if self.debug: │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/middleware/exceptions.py:62 in __call__ │ │ │ │ 59 │ │ else: │ │ 60 │ │ │ conn = WebSocket(scope, receive, send) │ │ 61 │ │ │ │ ❱ 62 │ │ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) │ │ 63 │ │ │ 64 │ def http_exception(self, request: Request, exc: Exception) -> Response: │ │ 65 │ │ assert isinstance(exc, HTTPException) │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/_exception_handler.py:64 in wrapped_app │ │ │ │ 61 │ │ │ │ handler = _lookup_exception_handler(exception_handlers, exc) │ │ 62 │ │ │ │ │ 63 │ │ │ if handler is None: │ │ ❱ 64 │ │ │ │ raise exc │ │ 65 │ │ │ │ │ 66 │ │ │ if response_started: │ │ 67 │ │ │ │ msg = "Caught handled exception, but response already started." │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/_exception_handler.py:53 in wrapped_app │ │ │ │ 50 │ │ │ await send(message) │ │ 51 │ │ │ │ 52 │ │ try: │ │ ❱ 53 │ │ │ await app(scope, receive, sender) │ │ 54 │ │ except Exception as exc: │ │ 55 │ │ │ handler = None │ │ 56 │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/routing.py:758 in __call__ │ │ │ │ 755 │ │ """ │ │ 756 │ │ The main entry point to the Router class. │ │ 757 │ │ """ │ │ ❱ 758 │ │ await self.middleware_stack(scope, receive, send) │ │ 759 │ │ │ 760 │ async def app(self, scope: Scope, receive: Receive, send: Send) -> None: │ │ 761 │ │ assert scope["type"] in ("http", "websocket", "lifespan") │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/routing.py:778 in app │ │ │ │ 775 │ │ │ match, child_scope = route.matches(scope) │ │ 776 │ │ │ if match == Match.FULL: │ │ 777 │ │ │ │ scope.update(child_scope) │ │ ❱ 778 │ │ │ │ await route.handle(scope, receive, send) │ │ 779 │ │ │ │ return │ │ 780 │ │ │ elif match == Match.PARTIAL and partial is None: │ │ 781 │ │ │ │ partial = route │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/routing.py:299 in handle │ │ │ │ 296 │ │ │ │ ) │ │ 297 │ │ │ await response(scope, receive, send) │ │ 298 │ │ else: │ │ ❱ 299 │ │ │ await self.app(scope, receive, send) │ │ 300 │ │ │ 301 │ def __eq__(self, other: typing.Any) -> bool: │ │ 302 │ │ return ( │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/routing.py:79 in app │ │ │ │ 76 │ │ │ │ response = await run_in_threadpool(func, request) │ │ 77 │ │ │ await response(scope, receive, send) │ │ 78 │ │ │ │ ❱ 79 │ │ await wrap_app_handling_exceptions(app, request)(scope, receive, send) │ │ 80 │ │ │ 81 │ return app │ │ 82 │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/_exception_handler.py:64 in wrapped_app │ │ │ │ 61 │ │ │ │ handler = _lookup_exception_handler(exception_handlers, exc) │ │ 62 │ │ │ │ │ 63 │ │ │ if handler is None: │ │ ❱ 64 │ │ │ │ raise exc │ │ 65 │ │ │ │ │ 66 │ │ │ if response_started: │ │ 67 │ │ │ │ msg = "Caught handled exception, but response already started." │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/_exception_handler.py:53 in wrapped_app │ │ │ │ 50 │ │ │ await send(message) │ │ 51 │ │ │ │ 52 │ │ try: │ │ ❱ 53 │ │ │ await app(scope, receive, sender) │ │ 54 │ │ except Exception as exc: │ │ 55 │ │ │ handler = None │ │ 56 │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/starlette/routing.py:74 in app │ │ │ │ 71 │ │ │ │ 72 │ │ async def app(scope: Scope, receive: Receive, send: Send) -> None: │ │ 73 │ │ │ if is_async_callable(func): │ │ ❱ 74 │ │ │ │ response = await func(request) │ │ 75 │ │ │ else: │ │ 76 │ │ │ │ response = await run_in_threadpool(func, request) │ │ 77 │ │ │ await response(scope, receive, send) │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/fastapi/routing.py:296 in app │ │ │ │ 293 │ │ │ │ │ │ │ response_args["status_code"] = current_status_code │ │ 294 │ │ │ │ │ │ if sub_response.status_code: │ │ 295 │ │ │ │ │ │ │ response_args["status_code"] = sub_response.status_code │ │ ❱ 296 │ │ │ │ │ │ content = await serialize_response( │ │ 297 │ │ │ │ │ │ │ field=response_field, │ │ 298 │ │ │ │ │ │ │ response_content=raw_response, │ │ 299 │ │ │ │ │ │ │ include=response_model_include, │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/fastapi/routing.py:160 in serialize_response │ │ │ │ 157 │ │ │ ) │ │ 158 │ │ │ │ 159 │ │ if hasattr(field, "serialize"): │ │ ❱ 160 │ │ │ return field.serialize( │ │ 161 │ │ │ │ value, │ │ 162 │ │ │ │ include=include, │ │ 163 │ │ │ │ exclude=exclude, │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/fastapi/_compat.py:147 in serialize │ │ │ │ 144 │ │ ) -> Any: │ │ 145 │ │ │ # What calls this code passes a value that already called │ │ 146 │ │ │ # self._type_adapter.validate_python(value) │ │ ❱ 147 │ │ │ return self._type_adapter.dump_python( │ │ 148 │ │ │ │ value, │ │ 149 │ │ │ │ mode=mode, │ │ 150 │ │ │ │ include=include, │ │ │ │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │ │ e-packages/pydantic/type_adapter.py:333 in dump_python │ │ │ │ 330 │ │ Returns: │ │ 331 │ │ │ The serialized object. │ │ 332 │ │ """ │ │ ❱ 333 │ │ return self.serializer.to_python( │ │ 334 │ │ │ __instance, │ │ 335 │ │ │ mode=mode, │ │ 336 │ │ │ by_alias=by_alias, │ ╰────────────────────────────────────────────────────────────────────────────────────────────────╯ PydanticSerializationError: Unable to serialize unknown type: ```
ravi-kumar-pilla commented 2 months ago

Hi @astrojuanlu ,

Thank you for raising the issue. I am able to produce the bug on my side. The documentation for creating a custom dataset seems to create a dictionary but the value inside it is of type pandas.series or polars.series (in your case).

I created a CustomDataset mimicking the excel_dataset implementation and modified the preview method as below -

    def preview(self, nrows, ncolumns, filters) -> TablePreview:
      dataset_copy = self._copy()
      dataset_copy._load_args["nrows"] = nrows
      filtered_data = dataset_copy.load()

      for column, value in filters.items():
          filtered_data = filtered_data[filtered_data[column] == value]
      subset = filtered_data.iloc[:nrows, :ncolumns]
      df_dict = {}
      for column in subset.columns:
          df_dict[column] = subset[column]

      return df_dict

catalog -

shuttles:
  type: demo_project.extras.datasets.custom_dataset.CustomDataset
  filepath: ${_base_location}/01_raw/shuttles.xlsx
  metadata:
    kedro-viz:
      layer: raw
      preview_args:
        nrows: 5
        ncolumns: 2 
        filters: {
          engine_type: Quantum 
        } 

This gives me the below preview data -

preview={'id': 0    63561
1    36260
2    57015
Name: id, dtype: int64, 'shuttle_location': 0                  Niue
1              Anguilla
2    Russian Federation
Name: shuttle_location, dtype: object},

So, when FAST API tries to searialize this response it throws the error PydanticSerializationError rightly. I tried to correct the preview method as below - (change - df_dict[column] = subset[column].to_dict())

  def preview(self, nrows, ncolumns, filters) -> TablePreview:
      dataset_copy = self._copy()
      dataset_copy._load_args["nrows"] = nrows
      filtered_data = dataset_copy.load()

      for column, value in filters.items():
          filtered_data = filtered_data[filtered_data[column] == value]
      subset = filtered_data.iloc[:nrows, :ncolumns]
      df_dict = {}
      for column in subset.columns:
          df_dict[column] = subset[column].to_dict()

      return df_dict

Now the response is a valid dict like -

 preview={'id': {0: 63561, 1: 36260, 2: 57015}, 'shuttle_location': {0: 'Niue', 1: 'Anguilla', 2: 'Russian Federation'}}

Though the error is resolved, there needs to be further discussion with the team to check if the dictionary needs to be in a certain format, so that it will be previewed as I am seeing a blank box in the UI with the above preview response.

image

An example preview response (TablePreview) which works -

preview={'index': [0, 1, 2, 3, 4], 'columns': ['id', 'company_rating', 'company_location', 'total_fleet_count', 'iata_approved'], 'data': [[35029, '100%', 'Niue', 4.0, 'f'], [30292, '67%', 'Anguilla', 6.0, 'f'], [19032, '67%', 'Russian Federation', 4.0, 'f'], [8238, '91%', 'Barbados', 15.0, 't'], [30342, nan, 'Sao Tome and Principe', 2.0, 't']]}

Next Steps:

  1. Confirm with the team (@jitu5 , @SajidAlamQB ) if there needs to be any specific format for the preview dictionary to be viewed in the UI ?
  2. Should the user be responsible to return a valid dictionary key-value pairs, from the preview method ? (cc: @rashidakanchwala )
  3. In case the user returns an invalid dictionary key-value pairs fast api raises the exception as it did. Should we show a custom error message for the user or leave it like this ? I think we should catch (PydanticSerializationError) for now and show the user some custom message which would allow them to check the preview data being returned.

Thank you

jitu5 commented 2 months ago

Confirm with the team (@jitu5 , @SajidAlamQB ) if there needs to be any specific format for the preview dictionary to be viewed in the UI ?

@ravi-kumar-pilla for the TablePreview, UI expect in below format

preview={
    'index': number[], 
    'columns': string[], 
    'data': any[][]  // List[List[Any]] 
}
ravi-kumar-pilla commented 2 months ago

Thanks @jitu5 , I think we should document the expected {key:value} pairs for all the supported preview types -

TablePreview = NewType("TablePreview", dict)
ImagePreview = NewType("ImagePreview", bytes)
PlotlyPreview = NewType("PlotlyPreview", dict)
JSONPreview = NewType("JSONPreview", dict)

I will create a ticket to document the expected schema and also see if we can enforce that in the backend so that, FE always gets what is required.

For this ticket, I would like to add an except block to catch PydanticSerializationError and inform the users to return a valid dictionary {key:value} pair. Thank you

rashidakanchwala commented 2 months ago

Thanks, Ravi. Great debugging. I now realize this is happening because while our TablePreview accepts a generic dict, our frontend only supports Pandas DataFrames for displaying tabular data and not any other formats such as Polars.

We could add validation in the backend, but I do feel our frontend needs to be more versatile to handle different tabular formats.

Update - we could check with Vizro, how they do this in the BE.

rashidakanchwala commented 2 months ago

The FE logic to distinguish between Polars and Pandas dataframe is quite straightforward.

Ideally, we could natively support both formats. My question is, do we anticipate more formats? If so, maybe we could use Plotly to preview data tables as well, instead of our own custom table preview?

@astrojuanlu?

astrojuanlu commented 2 months ago

I guess the main problem I see is that, as a user, it's not clear what the preview() method should return. I initially did the df_dict thing because I copy-pasted it from somewhere else without really understanding how it works.

rashidakanchwala commented 2 months ago

Yes, I think TablePreview is very generic when it is actually only meant for Pandas. So I am proposing we rename TablePreview to PandasTablePreview.

and create another NewType called PolarsTablePreview which also returns a dict.

The only difference is now in the FE, we have a different way of displaying both dicts as tables. What do you all think? @astrojuanlu , @ravi-kumar-pilla , @jitu5

jitu5 commented 2 months ago

@rashidakanchwala If the logic to distinguish between Polars and Pandas dataframe is straightforward at FE then bases on the type, we can create a separate components for each type to display.

astrojuanlu commented 2 months ago

Yes, I think TablePreview is very generic when it is actually only meant for Pandas. So I am proposing we rename TablePreview to PandasTablePreview.

and create another NewType called PolarsTablePreview which also returns a dict.

To clarify: if I change my preview function to return the PolarsTablePreview you're proposing and I don't change the body, would it work?

    def preview(self, nrows: int) -> PolarsTablePreview:
        subset = self._load().head(nrows)
        df_dict = {}
        for column in subset.columns:
            df_dict[column] = subset[column]
        return df_dict

My understanding is that someone needs to make sure the output of the function can be serialized.

  1. If this is the user responsibility, then the object returned from preview should have some constraints. Where should these be enforced?
  2. If the user can return whatever, then somebody else should take care of serializing that. But my understanding is that there are limits to how smart we can be about the object returned by preview, which takes me back to 1.

I think ideally the user would be helped by the IDE to return a proper object. How can we achieve this, so the journey is more clear?

rashidakanchwala commented 2 months ago

I agree, @ravi-kumar-pilla's ticket here https://github.com/kedro-org/kedro-viz/issues/1884 proposes what you mentioned.

At the time we decided to use NewType, we did not anticipate running into this; if we need to enforce validation - we might have to change the NewTypes into classes that we did consider before.