Open astrojuanlu opened 3 months ago
The code looks right to me. @SajidAlamQB , since u added 4 dataset previews recently, did you encounter this error? Is there anything u think looks incorrect?
Maybe you could try setting a default value for nrows
in the method definition nrows: int = 5
.
Setting a default value avoids that error, although there must be something fishy happening here:
On the other hand, now I get another error, but that's unrelated to this issue:
Hi @astrojuanlu ,
Thank you for raising the issue. I am able to produce the bug on my side. The documentation for creating a custom dataset seems to create a dictionary but the value inside it is of type pandas.series or polars.series (in your case).
I created a CustomDataset mimicking the excel_dataset implementation and modified the preview method as below -
def preview(self, nrows, ncolumns, filters) -> TablePreview:
dataset_copy = self._copy()
dataset_copy._load_args["nrows"] = nrows
filtered_data = dataset_copy.load()
for column, value in filters.items():
filtered_data = filtered_data[filtered_data[column] == value]
subset = filtered_data.iloc[:nrows, :ncolumns]
df_dict = {}
for column in subset.columns:
df_dict[column] = subset[column]
return df_dict
catalog -
shuttles:
type: demo_project.extras.datasets.custom_dataset.CustomDataset
filepath: ${_base_location}/01_raw/shuttles.xlsx
metadata:
kedro-viz:
layer: raw
preview_args:
nrows: 5
ncolumns: 2
filters: {
engine_type: Quantum
}
This gives me the below preview data -
preview={'id': 0 63561
1 36260
2 57015
Name: id, dtype: int64, 'shuttle_location': 0 Niue
1 Anguilla
2 Russian Federation
Name: shuttle_location, dtype: object},
So, when FAST API tries to searialize this response it throws the error PydanticSerializationError rightly. I tried to correct the preview method as below - (change - df_dict[column] = subset[column].to_dict())
def preview(self, nrows, ncolumns, filters) -> TablePreview:
dataset_copy = self._copy()
dataset_copy._load_args["nrows"] = nrows
filtered_data = dataset_copy.load()
for column, value in filters.items():
filtered_data = filtered_data[filtered_data[column] == value]
subset = filtered_data.iloc[:nrows, :ncolumns]
df_dict = {}
for column in subset.columns:
df_dict[column] = subset[column].to_dict()
return df_dict
Now the response is a valid dict like -
preview={'id': {0: 63561, 1: 36260, 2: 57015}, 'shuttle_location': {0: 'Niue', 1: 'Anguilla', 2: 'Russian Federation'}}
Though the error is resolved, there needs to be further discussion with the team to check if the dictionary needs to be in a certain format, so that it will be previewed as I am seeing a blank box in the UI with the above preview response.
An example preview response (TablePreview) which works -
preview={'index': [0, 1, 2, 3, 4], 'columns': ['id', 'company_rating', 'company_location', 'total_fleet_count', 'iata_approved'], 'data': [[35029, '100%', 'Niue', 4.0, 'f'], [30292, '67%', 'Anguilla', 6.0, 'f'], [19032, '67%', 'Russian Federation', 4.0, 'f'], [8238, '91%', 'Barbados', 15.0, 't'], [30342, nan, 'Sao Tome and Principe', 2.0, 't']]}
Next Steps:
Thank you
Confirm with the team (@jitu5 , @SajidAlamQB ) if there needs to be any specific format for the preview dictionary to be viewed in the UI ?
@ravi-kumar-pilla for the TablePreview, UI expect in below format
preview={
'index': number[],
'columns': string[],
'data': any[][] // List[List[Any]]
}
Thanks @jitu5 , I think we should document the expected {key:value} pairs for all the supported preview types -
TablePreview = NewType("TablePreview", dict)
ImagePreview = NewType("ImagePreview", bytes)
PlotlyPreview = NewType("PlotlyPreview", dict)
JSONPreview = NewType("JSONPreview", dict)
I will create a ticket to document the expected schema and also see if we can enforce that in the backend so that, FE always gets what is required.
For this ticket, I would like to add an except block to catch PydanticSerializationError and inform the users to return a valid dictionary {key:value} pair. Thank you
Thanks, Ravi. Great debugging. I now realize this is happening because while our TablePreview accepts a generic dict, our frontend only supports Pandas DataFrames for displaying tabular data and not any other formats such as Polars.
We could add validation in the backend, but I do feel our frontend needs to be more versatile to handle different tabular formats.
Update - we could check with Vizro, how they do this in the BE.
The FE logic to distinguish between Polars and Pandas dataframe is quite straightforward.
Ideally, we could natively support both formats. My question is, do we anticipate more formats? If so, maybe we could use Plotly to preview data tables as well, instead of our own custom table preview?
@astrojuanlu?
I guess the main problem I see is that, as a user, it's not clear what the preview()
method should return. I initially did the df_dict
thing because I copy-pasted it from somewhere else without really understanding how it works.
Yes, I think TablePreview is very generic when it is actually only meant for Pandas. So I am proposing we rename TablePreview to PandasTablePreview.
and create another NewType called PolarsTablePreview which also returns a dict.
The only difference is now in the FE, we have a different way of displaying both dicts as tables. What do you all think? @astrojuanlu , @ravi-kumar-pilla , @jitu5
@rashidakanchwala If the logic to distinguish between Polars and Pandas dataframe is straightforward at FE then bases on the type, we can create a separate components for each type to display.
Yes, I think TablePreview is very generic when it is actually only meant for Pandas. So I am proposing we rename TablePreview to PandasTablePreview.
and create another NewType called PolarsTablePreview which also returns a dict.
To clarify: if I change my preview
function to return the PolarsTablePreview
you're proposing and I don't change the body, would it work?
def preview(self, nrows: int) -> PolarsTablePreview:
subset = self._load().head(nrows)
df_dict = {}
for column in subset.columns:
df_dict[column] = subset[column]
return df_dict
My understanding is that someone needs to make sure the output of the function can be serialized.
preview
should have some constraints. Where should these be enforced?preview
, which takes me back to 1.I think ideally the user would be helped by the IDE to return a proper object. How can we achieve this, so the journey is more clear?
I agree, @ravi-kumar-pilla's ticket here https://github.com/kedro-org/kedro-viz/issues/1884 proposes what you mentioned.
At the time we decided to use NewType, we did not anticipate running into this; if we need to enforce validation - we might have to change the NewTypes into classes that we did consider before.
Description
I tried adding preview support to a custom dataset like this:
And yet I got an error:
What am I doing wrong?
Your Environment
Include as many relevant details as possible about the environment you experienced the bug in:
Checklist