Closed lukaszdz closed 1 year ago
This is more of a kedro-viz issue so I've moved it 🙂 This is a great suggestion @lukaszdz and is also something I've pondered before so let me add some thoughts here... Related: https://github.com/quantumblacklabs/kedro/issues/1076 https://github.com/quantumblacklabs/private-kedro/issues/1148
For an immediate solution, outside viz there are actually a couple of different ways you might be able to achieve what you're looking for already:
And one which will show you something, though not exactly what you want, in kedro viz:
tracking.JSONDataSet
or tracking.MetricsDataSet
datasets. As part of the new experiment tracking functionality you would then be able to visualise this in a graph in kedro viz, including seeing how the number changes over time between different runsI love this idea and would actually like to make it more general. As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.*
is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset. Let me call this a "trackable".
In the future I think there should be two possible methods for this:
tracking
dataset. Crucially this will give you a sense of how the trackable changes between one kedro run and the next, since I should be able to go back in time and visualise the pipeline and datasets of historic runs. shuttles:
type: pandas.CSVDataSet
filepath: ...
viz_widgets:
number_of_rows
number_of_na: column1, column2, column3
my_custom_widget
Where we supply with kedro viz a few common widgets like number_of_rows
, but a user can define their own my_custom_widget
also so it's very flexible. The natural place for this information to be shown on kedro viz would be the side panel on the right hand side that appears when you click on a dataset. But it would be super cool if somehow we could make the pipeline visualisation customisable with user-pluggable widgets too.
This would also be great, and actually I don't think we're too far off being able to do it. We already hacked together something which gets halfway there during a hackathon. Again I'd actually go further here: ideally kedro viz would live update while you're doing a run and show which is the currently running node, and I'd also be able to trigger runs from kedro viz.
FYI @MerelTheisenQB @tynandebold @studioswong very relevant to what we did during the hackathon and the general question of people tracking things through kedro-viz that aren't metrics in the traditional sense (i.e. not model performance).
We have a design for a possible solution here, which looks like this:
This feature becomes unlocked by this change as well as an addition we'd have to make in Kedro datasets.
Copying a user's comment and request for this feature on the slack channel here:
"I want to log the number of rows for the datasets at each step of my pipeline. It's for debugging. The goal is to notice big drop of rows during one data transformation step. For example, after one node, I may see that my number of lines drops by 30% when it’s supposed to stay the same."
Hey everyone - I was chatting to Nero seeing this go into progress and I have some thoughts on the feature because there is a lot of potential value here.
class VizMetricHooks:
@hook_impl
def after_catalog_created(self, catalog: DataCatalog) -> None:
def _add_shape_metadata(dataset):
rows, columns = dataset.load().shape
metadata = {
"kedro_viz": {"side_bar": {"num_rows": rows, "num_columns": columns}}
}
dataset.metadata = metadata
return dataset
pandas_datasets = {
name: _add_shape_metadata(dataset_instance)
for name, dataset_instance in catalog.datasets.__dict__.items()
if not name.startswith("param")
and "pandas" in str(type(dataset_instance))
and dataset_instance.exists()
}
for name, dataset_instance in pandas_datasets.items():
catalog.add(name, dataset_instance, replace=True)
I agree with @datajoely on this. Getting the statistics displayed in the metadata panel would be helpful but it will be really hard for the users to compare and get a bird's eye view. If we do not want to clutter the flowchart with the stats view, we need to have some sort of comparison view (like a table may be). We can extend more on this once we have new designs for the comparison view. Thank you !
Hi Team,
@merelcht, @noklam, @rashidakanchwala I am working on this story and I need some suggestions.
reviews:
type: pandas.CSVDataSet
filepath: ${base_location}/01_raw/reviews.csv
metadata:
kedro-viz:
layer: raw
preview_args:
nrows: 10
profiler_args:
show: true
show
key, we will get the stats (rows, columns, file size) without loading the entire file into memory. csv
and openpyxl
like - https://github.com/kedro-org/kedro-plugins/compare/feature/profiler-csv-excel (any suggestions would help).Thank you !
Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -
- In the catalog files, users can mention profiler_args as below -
reviews: type: pandas.CSVDataSet filepath: ${base_location}/01_raw/reviews.csv metadata: kedro-viz: layer: raw preview_args: nrows: 10 profiler_args: show: true
- Based on profiler_args
show
key, we will get the stats (rows, columns, file size) without loading the entire file into memory.
I think the above is actually a bit inconsistent. If you call the key profiler_args
I'd expect to be able to provide the arguments of what's going to be displayed. Whereas "show" doesn't specify at all what's going to be shown. So in this case maybe it could be a list like:
profiler_args:
- rows
- columns
- file_size
That also allows for flexibility where for some datasets you can show all these things and others maybe only the file size.
Questions -
- For local files, this can be acheived using the
csv
andopenpyxl
like - https://github.com/kedro-org/kedro-plugins/compare/feature/profiler-csv-excel (any suggestions would help).- I would like to know how can we do profiling without loading the entire file to memory when the files are stored in remote locations (S3, Azure, GCS, HTTPS) ?
- Should we support profiling for remote locations or just local ?
I think this depends on what "metrics" we exactly want to show. I think it should be possible to get file size without downloading the data, but maybe some of the other things are not possible to provide without downloading.
Thanks @datajoely for the comments, I agree with this.
The goal of this ticket is to help a user debug their dataset, by enabling them to easily compare (preset) attributes that may have changed during data transformation of a run. Yes having the information in the sidebar limits data comparison, which is the user’s objective.
As mentioned by @antonymilne above:
As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.* is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset.
Based on all of this and a conversation with @studioswong, here are some potential next steps.
CC @amandakys @stephkaiser @ravi-kumar-pilla
Had a really productive chat with @ravi-kumar-pilla today about the dataset statistics in the metadata panel
Some key takeaways: Dataset Statistics in the Metadata panel
Dataset statistics comparison
I had a discussion with @rashidakanchwala about what statistics can we display for quick debugging. Retrieving total number of row/columns seems to be an expensive operation for some dataset types like excel.
Also, there might not be rows/columns for few datasets like PlotlyDataSet or Json etc. So we thought this ticket needs some technical discussion regarding what stats can be globally available for all datasets and will be useful for debugging.
One such stat we thought of was the file size. Getting a file size can be less expensive and can give some details to debug if something is drastically wrong. As per the implementation goes, we are not sure if extracting the file size should be part of each kedro-dataset plugin or be part of Kedro Framework AbstractDataSet implementation. It would be great to have this in a technical discussion across the team.
@merelcht @astrojuanlu @noklam please suggest
Thank you !
Imo it shouldn't be implemented in kedro or kedro-dataset. The preview method was viz only, why can't it be implemented on viz side instead? This should be true for any other plugins.
In terms on implementation of the feature, filesize is cheap to get via the filesystem. For columns and rows maybe we can just trim it if it exceed a certain amount of rows to say "more than 1000000 rows".
More crazy idea, can viz use hooks to record the statistic during a kedro run? This way there is no cost to read the stats.
Thank you @noklam . I see what you are saying, it make sense to have it on the viz side.
I would not completely agree on trimming the rows info as this still takes time and also we might not have rows for all datasets.
I think for the first pass, we can get the file size stat across all datasets. I am not well aware of the hook implementation you suggested here. If the crazy idea is efficient, we should do that :D
@tynandebold any suggestions here ?
Thank you
Does "file size" make sense for, say, APIDataSet
?
I essentially agree with @tynandebold above, this should probably be focused on arbitrary key-value pairs and datasets can expose that dataset_info
somehow.
A lot of good points being raised. Let me synthesize some of it and make some suggestions:
At this stage, one main constraint is UI/UX design. The completed designs have this feature living in the Metadata panel, which, as many of you have raised, is suboptimal and doesn't add much value. Nevertheless, if we can't get a new design done that moves some of this information into the flowchart by the time the engineering work is ready, my suggestion is to first release the work in the Metadata panel and then move it elsewhere once the design is ready.
On the subject of some type of comparison table, I think that's outside of the scope of this implementation work. I'd rather see us work towards some sort of "dev mode" or "debug mode" toggle, which shows more detailed information on the flowchart when it's enabled, as written by @amandakys above.
Lastly, on this point:
the loading icon displayed with dataset statistics are fetching should be moved to be inside the metadata panel rather than displayed above the main flowchart. @amandakys to provide visuals for this
Are you saying we replace the main loading indicator we have over the flowchart and move it in the metadata panel? If yes, I don't think we should do that, as the flowchart sometimes needs an indicator to show when it's loading for larger pipelines. We can add a loading indicator into the Metadata panel, and it should probably match with the skeleton loader we have in experiment tracking, since it's inline data.
A big question is around what should we allow the user to show. I agree with @datajoely here, in that we should allow them to configure key/values arbitrarily in the new metadata YAML, and even better if we can do that dynamically with something like VizMetricHooks
as he used as an example.
This is a great summary 🚀
On the loading indicator, when Ravi showed me a demo of the feature, the loading icon was displayed over the flowchart. It did not block interaction with the flowchart and was there to indicate that metadata was loading. This felt misleading as it was not indicating that the flowchart was loading.
I was not suggesting we move the global loading icon to the metadata panel, just that metadata loading should be indicated in the metadata panel. For this the skeleton loader sounds like the best solution.
From my side, the things that would be relevant to this ticket's implementation work are:
Based on @tynandebold's comment here I've opened another ticket to explore the concept of a dev/debug mode. The need, the use cases and the opportunities. #1464
On the subject of some type of comparison table, I think that's outside of the scope of this implementation work. I'd rather see us work towards some sort of "dev mode" or "debug mode" toggle, which shows more detailed information on the flowchart when it's enabled
@ravi-kumar-pilla https://github.com/kedro-org/kedro-viz/pull/1465 a quick PoC to demonstrate what I mean.
Hi @noklam , Thank you for the quick POC. I am not familiar with python hooks or Kedro Framework hook used in the POC.
I think we should collect stats during a kedro run and then kedro viz can read the stats file to display the metadata. This would be the most optimal way to retrieve the stats as they are pre-calculated.
As @datajoely pointed we need to look at a way to let users configure this dynamically. It would be nice if this metadata can be collected for every run like experiment tracking in a database and then viz can read it ( we can have a history of metadata change ). I clearly have a huge knowledge gap in this area and let me understand hooks first before I can comment further on this ticket. Thank you !!
Happy to walk you through that, maybe can combine it with a few new joiners. It's covered in kedro intermediate training or we can revive the Kedro University.
@noklam I don't think Viz can use hooks and I think that would need to be done in Kedro, right? @tynandebold viz can use hook.
I think I am missing context here. I can advise on the implementation and design but I need to understand the scope of this ticket better.
@NeroOkwa Maybe a quick catch up?
What's the goal?
There are lots of optimisation we can do, the solution can also be just hooks, plugins,
@lukaszdz this feature has been implemented on the latest Kedro-Viz release. Can you confirm if this solves your pain point and provide feed back. Thanks.
@NeroOkwa This is almost there. Ideally, we would want to see the dataset sizes in the graph view so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red)
can be viewed directly on the node in the graph view:
can use abbreviations with up to 3 digits to show the rough size/number of rows.
If empty - then can be red:
The goal is to be able to quickly visually know whether some steps in the pipeline failed to run.
In the future, you could imagine also having rules to color the node as red if a node deviates from its normal values. for example, say the companies node size is 77,000 rows on Monday, 77,100 on Tuesday, 78,000 Wed, then drops to 10,000 on Thursday. Then you could see at a glance that something failed with the node, visually. This would greatly accelerate debugging pipelines.
@lukaszdz thanks for the feed back.
The goal is to be able to quickly visually know whether some steps in the pipeline failed to run.
I have 3 follow up questions:
@lukaszdz thank you for the feedback. For 3. could you give a little bit of details what kind of data pipeline you are trying to build and kedro fails you?
@lukaszdz pls share an email address with which I can book the user interview session. Thanks.
lukasz.apps@gmail.com
@lukaszdz, the session has been booked for today 18/09/23.
Description
I'm always frustrated when I'm running daily or weekly sets of modular pipelines and my final output does not make complete sense. This indicates that there was an issue when running the pipeline but I'm not sure, at a glance, what step didn't provide output.
One example problem: one initial dataset had the mapping of market IDs. One day, the market ID for our second biggest market was omitted from the first step, causing all subsequent downstream analysis to be off by a nontrivial amount.
Context
This change is important to me because it would help me, at a glance, identify changes across runs through visual cues, so I know where to begin.
Possible Implementation
Visualize the total size of each dataset that has been processed via kedro viz: The day that things ran correctly:
The day that things failed: Would be nice to also visualize the nodes that had been attempted to run, but failed
In this example, by visualizing the size of each step that had been run, you would immediately see that the data set with the biggest difference was the companies set. Even though the pipeline strictly failed a step later, you would immediately know where to start debugging.