Visualize size of processed datasets

lukaszdz commented 2 years ago

Description

I'm always frustrated when I'm running daily or weekly sets of modular pipelines and my final output does not make complete sense. This indicates that there was an issue when running the pipeline but I'm not sure, at a glance, what step didn't provide output.

One example problem: one initial dataset had the mapping of market IDs. One day, the market ID for our second biggest market was omitted from the first step, causing all subsequent downstream analysis to be off by a nontrivial amount.

Context

This change is important to me because it would help me, at a glance, identify changes across runs through visual cues, so I know where to begin.

Possible Implementation

Visualize the total size of each dataset that has been processed via kedro viz: The day that things ran correctly:

The day that things failed: Would be nice to also visualize the nodes that had been attempted to run, but failed

In this example, by visualizing the size of each step that had been run, you would immediately see that the data set with the biggest difference was the companies set. Even though the pipeline strictly failed a step later, you would immediately know where to start debugging.

antonymilne commented 2 years ago

This is more of a kedro-viz issue so I've moved it 🙂 This is a great suggestion @lukaszdz and is also something I've pondered before so let me add some thoughts here... Related: https://github.com/quantumblacklabs/kedro/issues/1076 https://github.com/quantumblacklabs/private-kedro/issues/1148

Current methods for tracking dataset size

For an immediate solution, outside viz there are actually a couple of different ways you might be able to achieve what you're looking for already:

great expectations via the kedro-great plugin. I'm not at all familiar with this myself but I imagine you should be able to write some rule that validates the number of rows in a dataframe
use a hook that emits a log message (to the console or a file) giving the number of rows in the dataframe like this

And one which will show you something, though not exactly what you want, in kedro viz:

make a node that takes in all the datasets you want to check the size of and save the information to one of the new tracking.JSONDataSet or tracking.MetricsDataSet datasets. As part of the new experiment tracking functionality you would then be able to visualise this in a graph in kedro viz, including seeing how the number changes over time between different runs

More generally

I love this idea and would actually like to make it more general. As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.* is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset. Let me call this a "trackable".

In the future I think there should be two possible methods for this:

via experiment tracking - this is already work in progress. You can write code to calculate whatever trackable you like in a node and then save it to a tracking dataset. Crucially this will give you a sense of how the trackable changes between one kedro run and the next, since I should be able to go back in time and visualise the pipeline and datasets of historic runs.
some kind of customisable "widget" which allows me to give, in the catalog, as many trackables as I like, e.g. (completely made up example syntax)
```
shuttles:
type: pandas.CSVDataSet
filepath: ...
viz_widgets:
    number_of_rows
    number_of_na: column1, column2, column3
    my_custom_widget
```

Where we supply with kedro viz a few common widgets like number_of_rows, but a user can define their own my_custom_widget also so it's very flexible. The natural place for this information to be shown on kedro viz would be the side panel on the right hand side that appears when you click on a dataset. But it would be super cool if somehow we could make the pipeline visualisation customisable with user-pluggable widgets too.

Visualising failed nodes

This would also be great, and actually I don't think we're too far off being able to do it. We already hacked together something which gets halfway there during a hackathon. Again I'd actually go further here: ideally kedro viz would live update while you're doing a run and show which is the currently running node, and I'd also be able to trigger runs from kedro viz.

antonymilne commented 2 years ago

FYI @MerelTheisenQB @tynandebold @studioswong very relevant to what we did during the hackathon and the general question of people tracking things through kedro-viz that aren't metrics in the traditional sense (i.e. not model performance).

tynandebold commented 1 year ago

We have a design for a possible solution here, which looks like this:

This feature becomes unlocked by this change as well as an addition we'd have to make in Kedro datasets.

NeroOkwa commented 1 year ago

Copying a user's comment and request for this feature on the slack channel here:

"I want to log the number of rows for the datasets at each step of my pipeline. It's for debugging. The goal is to notice big drop of rows during one data transformation step. For example, after one node, I may see that my number of lines drops by 30% when it’s supposed to stay the same."

datajoely commented 1 year ago

Hey everyone - I was chatting to Nero seeing this go into progress and I have some thoughts on the feature because there is a lot of potential value here.

Evaluating the original user request against the sidebar solution

The use wanted to show custom metadata directly on the flowchart
The point of this was to a direct comparison between nodes and a bird's eye view
Pushing this information to the sidebar doesn't allow the user to compare this data in any meaningful way. They can't compare two or more datasets to make any decisions like the empty file issue the user reports.
Providing a comparison workflow is important here, a low effort way of doing so would be providing some sort of table view.

Challenging the decision to index tightly on dataset statistics, we should provide a mechanism to provide key/values arbitrarily

I would challenge the idea that we should opine on what statistics are important for our users
The new metadata YAML allows us to provide any arbitrary field, why limit this to just dataset statistics? Users will immediately start asking for other attributes.

We need to provide a way of letting users configure this dynamically

If we stick to the dataset statistics point, no user is going to update these data points manually.
We need to provide an interface for doing so dynamically. Hooks are the right solution for this.
This is a super naive solution to the actual problem, but we should be building this in a way that empowers users to add their own data:


class VizMetricHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog) -> None:
        def _add_shape_metadata(dataset):
            rows, columns = dataset.load().shape
            metadata = {
                "kedro_viz": {"side_bar": {"num_rows": rows, "num_columns": columns}}
            }
            dataset.metadata = metadata
            return dataset

        pandas_datasets = {
            name: _add_shape_metadata(dataset_instance)
            for name, dataset_instance in catalog.datasets.__dict__.items()
            if not name.startswith("param")
            and "pandas" in str(type(dataset_instance))
            and dataset_instance.exists()
        }

        for name, dataset_instance in pandas_datasets.items():
            catalog.add(name, dataset_instance, replace=True)

ravi-kumar-pilla commented 1 year ago

I agree with @datajoely on this. Getting the statistics displayed in the metadata panel would be helpful but it will be really hard for the users to compare and get a bird's eye view. If we do not want to clutter the flowchart with the stats view, we need to have some sort of comparison view (like a table may be). We can extend more on this once we have new designs for the comparison view. Thank you !

ravi-kumar-pilla commented 1 year ago

Hi Team,

@merelcht, @noklam, @rashidakanchwala    I am working on this story and I need some suggestions. 

Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -

In the catalog files, users can mention profiler_args as below -

reviews:
  type: pandas.CSVDataSet
  filepath: ${base_location}/01_raw/reviews.csv
  metadata:
    kedro-viz:
      layer: raw
      preview_args: 
        nrows: 10
      profiler_args:
        show: true

Based on profiler_args show key, we will get the stats (rows, columns, file size) without loading the entire file into memory.

Questions -

For local files, this can be acheived using the csv and openpyxl like - https://github.com/kedro-org/kedro-plugins/compare/feature/profiler-csv-excel (any suggestions would help).
I would like to know how can we do profiling without loading the entire file to memory when the files are stored in remote locations (S3, Azure, GCS, HTTPS) ?
Should we support profiling for remote locations or just local ?

Thank you !

merelcht commented 1 year ago

Considered approach to support pandas.CSVDataSet and pandas.ExcelDataSet -

In the catalog files, users can mention profiler_args as below -
reviews:
  type: pandas.CSVDataSet
  filepath: ${base_location}/01_raw/reviews.csv
  metadata:
    kedro-viz:
      layer: raw
      preview_args: 
        nrows: 10
      profiler_args:
        show: true
Based on profiler_args show key, we will get the stats (rows, columns, file size) without loading the entire file into memory.

I think the above is actually a bit inconsistent. If you call the key profiler_args I'd expect to be able to provide the arguments of what's going to be displayed. Whereas "show" doesn't specify at all what's going to be shown. So in this case maybe it could be a list like:

profiler_args:
   - rows
   - columns
   - file_size

That also allows for flexibility where for some datasets you can show all these things and others maybe only the file size.

Questions -

For local files, this can be acheived using the csv and openpyxl like - https://github.com/kedro-org/kedro-plugins/compare/feature/profiler-csv-excel (any suggestions would help).

I would like to know how can we do profiling without loading the entire file to memory when the files are stored in remote locations (S3, Azure, GCS, HTTPS) ?

Should we support profiling for remote locations or just local ?

I think this depends on what "metrics" we exactly want to show. I think it should be possible to get file size without downloading the data, but maybe some of the other things are not possible to provide without downloading.

NeroOkwa commented 1 year ago

Thanks @datajoely for the comments, I agree with this.

Summary

The goal of this ticket is to help a user debug their dataset, by enabling them to easily compare (preset) attributes that may have changed during data transformation of a run. Yes having the information in the sidebar limits data comparison, which is the user’s objective.

As mentioned by @antonymilne above:

As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.* is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset.

The first step would be focusing on dataset statistics e.g. number or rows/columns e.t.c. and later other attributes (based on feedback and metrics).
Another opportunity as highlighted by @datajoely would be to provide an interface in Kedro-Viz for users to dynamically configure these attributes (via hooks) vs the current manual approach.
The next step and opportunity would be for users to be able to debug nodes, by ‘visualising failed nodes’, but that’s beyond the scope of this ticket.

Based on all of this and a conversation with @studioswong, here are some potential next steps.

Potential next steps

Similarly to how when you click on a node and the side panel opens with an option to ‘Show Code’, we can have the same implementation when you click on a dataset but the show code would open up a canvas with the comparison table. This is a better MVP solution than using the side bar only, and we don’t have to change the existing flowchart.
We can design the comparison table using the ‘Compare runs’ feature in experiment tracking as inspiration.

CC @amandakys @stephkaiser @ravi-kumar-pilla

amandakys commented 1 year ago

Had a really productive chat with @ravi-kumar-pilla today about the dataset statistics in the metadata panel

Some key takeaways: Dataset Statistics in the Metadata panel

the loading icon displayed with dataset statistics are fetching should be moved to be inside the metadata panel rather than displayed above the main flowchart. @amandakys to provide visuals for this
it might be worth displaying the dataset statistics label in the metadata panel even when they aren't enabled for that dataset just for visibility. This will also make it clearer what is being loaded. If profiler args aren't enabled, it can display something like "not configured" so users know that they can take steps to do that if they want dataset statistics.

Dataset statistics comparison

Using a Show Code style toggle to open up a panel with a comparison table is a valid option for enabling comparison, but it is a multi-dataset feature that will be accessibility only by first selecting a dataset.
We discussed options to visualise these statistics on the flowchart itself like shown by the user. An alternative could be designing "profiler mode" similar to the "show/hide labels" which changes the flowchart's display to show relevant dataset statistics and enable comparison of statistics in conjunction with the display of dataset relationships that is available with the flowchart. This is still a very rough idea and will need further investigation.
I like the idea of taking inspiration from the compare runs feature, as that will increase consistency and I'll be looking into this next.

ravi-kumar-pilla commented 1 year ago

I had a discussion with @rashidakanchwala about what statistics can we display for quick debugging. Retrieving total number of row/columns seems to be an expensive operation for some dataset types like excel.

Also, there might not be rows/columns for few datasets like PlotlyDataSet or Json etc. So we thought this ticket needs some technical discussion regarding what stats can be globally available for all datasets and will be useful for debugging.

One such stat we thought of was the file size. Getting a file size can be less expensive and can give some details to debug if something is drastically wrong. As per the implementation goes, we are not sure if extracting the file size should be part of each kedro-dataset plugin or be part of Kedro Framework AbstractDataSet implementation. It would be great to have this in a technical discussion across the team.

@merelcht @astrojuanlu @noklam please suggest

Thank you !

noklam commented 1 year ago

Imo it shouldn't be implemented in kedro or kedro-dataset. The preview method was viz only, why can't it be implemented on viz side instead? This should be true for any other plugins.

In terms on implementation of the feature, filesize is cheap to get via the filesystem. For columns and rows maybe we can just trim it if it exceed a certain amount of rows to say "more than 1000000 rows".

More crazy idea, can viz use hooks to record the statistic during a kedro run? This way there is no cost to read the stats.

ravi-kumar-pilla commented 1 year ago

Thank you @noklam . I see what you are saying, it make sense to have it on the viz side.

I would not completely agree on trimming the rows info as this still takes time and also we might not have rows for all datasets.

I think for the first pass, we can get the file size stat across all datasets. I am not well aware of the hook implementation you suggested here. If the crazy idea is efficient, we should do that :D

@tynandebold any suggestions here ?

Thank you

astrojuanlu commented 1 year ago

Does "file size" make sense for, say, APIDataSet?

I essentially agree with @tynandebold above, this should probably be focused on arbitrary key-value pairs and datasets can expose that dataset_info somehow.

tynandebold commented 1 year ago

A lot of good points being raised. Let me synthesize some of it and make some suggestions:

Design opportunities

At this stage, one main constraint is UI/UX design. The completed designs have this feature living in the Metadata panel, which, as many of you have raised, is suboptimal and doesn't add much value. Nevertheless, if we can't get a new design done that moves some of this information into the flowchart by the time the engineering work is ready, my suggestion is to first release the work in the Metadata panel and then move it elsewhere once the design is ready.

On the subject of some type of comparison table, I think that's outside of the scope of this implementation work. I'd rather see us work towards some sort of "dev mode" or "debug mode" toggle, which shows more detailed information on the flowchart when it's enabled, as written by @amandakys above.

Lastly, on this point:

the loading icon displayed with dataset statistics are fetching should be moved to be inside the metadata panel rather than displayed above the main flowchart. @amandakys to provide visuals for this

Are you saying we replace the main loading indicator we have over the flowchart and move it in the metadata panel? If yes, I don't think we should do that, as the flowchart sometimes needs an indicator to show when it's loading for larger pipelines. We can add a loading indicator into the Metadata panel, and it should probably match with the skeleton loader we have in experiment tracking, since it's inline data.

Engineering opportunities

A big question is around what should we allow the user to show. I agree with @datajoely here, in that we should allow them to configure key/values arbitrarily in the new metadata YAML, and even better if we can do that dynamically with something like VizMetricHooks as he used as an example.

If we could define some defaults here, like "file size"/"rows"/"columns" that may be useful, and as @amandakys wrote above, display them even if there's no value for that particular dataset to promote discoverability.
@noklam I don't think Viz can use hooks and I think that would need to be done in Kedro, right?
My suggestion here is to try and get "file size"/"rows"/"columns" to show up for every dataset, and for the ones where it doesn't make sense, don't show a value. One we have that wired up, it's trivial enough to move the data from the Metadata panel to another part of the app.

amandakys commented 1 year ago

This is a great summary 🚀

On the loading indicator, when Ravi showed me a demo of the feature, the loading icon was displayed over the flowchart. It did not block interaction with the flowchart and was there to indicate that metadata was loading. This felt misleading as it was not indicating that the flowchart was loading.

I was not suggesting we move the global loading icon to the metadata panel, just that metadata loading should be indicated in the metadata panel. For this the skeleton loader sounds like the best solution.

From my side, the things that would be relevant to this ticket's implementation work are:

using the skeleton loader instead of the flowchart loading icon when metadata panel is getting data
adding default values or the "dataset statistics" label to the metadata panel even when no value is available for discoverability

Based on @tynandebold's comment here I've opened another ticket to explore the concept of a dev/debug mode. The need, the use cases and the opportunities. #1464

On the subject of some type of comparison table, I think that's outside of the scope of this implementation work. I'd rather see us work towards some sort of "dev mode" or "debug mode" toggle, which shows more detailed information on the flowchart when it's enabled

noklam commented 1 year ago

@ravi-kumar-pilla https://github.com/kedro-org/kedro-viz/pull/1465 a quick PoC to demonstrate what I mean.

ravi-kumar-pilla commented 1 year ago

Hi @noklam , Thank you for the quick POC. I am not familiar with python hooks or Kedro Framework hook used in the POC.

I think we should collect stats during a kedro run and then kedro viz can read the stats file to display the metadata. This would be the most optimal way to retrieve the stats as they are pre-calculated.

As @datajoely pointed we need to look at a way to let users configure this dynamically. It would be nice if this metadata can be collected for every run like experiment tracking in a database and then viz can read it ( we can have a history of metadata change ). I clearly have a huge knowledge gap in this area and let me understand hooks first before I can comment further on this ticket. Thank you !!

noklam commented 1 year ago

Happy to walk you through that, maybe can combine it with a few new joiners. It's covered in kedro intermediate training or we can revive the Kedro University.

noklam commented 1 year ago

@noklam I don't think Viz can use hooks and I think that would need to be done in Kedro, right? @tynandebold viz can use hook.

noklam commented 1 year ago

I think I am missing context here. I can advise on the implementation and design but I need to understand the scope of this ticket better.

@NeroOkwa Maybe a quick catch up?

What's the goal?

Is there any MVP we aim?
is filesize/row/column enough?
performance concern?
Do we need to cover versions or we only show latest?

There are lots of optimisation we can do, the solution can also be just hooks, plugins,

NeroOkwa commented 1 year ago

@lukaszdz this feature has been implemented on the latest Kedro-Viz release. Can you confirm if this solves your pain point and provide feed back. Thanks.

lukaszdz commented 1 year ago

@NeroOkwa This is almost there. Ideally, we would want to see the dataset sizes in the graph view so we can view any issues with the pipeline without having to click through each node in the graph. Even better if we had some way to set up some rules to color the nodes (if N=0, then color the node red)

lukaszdz commented 1 year ago

can be viewed directly on the node in the graph view:

can use abbreviations with up to 3 digits to show the rough size/number of rows.

If empty - then can be red:

The goal is to be able to quickly visually know whether some steps in the pipeline failed to run.

In the future, you could imagine also having rules to color the node as red if a node deviates from its normal values. for example, say the companies node size is 77,000 rows on Monday, 77,100 on Tuesday, 78,000 Wed, then drops to 10,000 on Thursday. Then you could see at a glance that something failed with the node, visually. This would greatly accelerate debugging pipelines.

NeroOkwa commented 1 year ago

@lukaszdz thanks for the feed back.

The goal is to be able to quickly visually know whether some steps in the pipeline failed to run.

I have 3 follow up questions:

Previously, what steps have you observed failed in the pipeline run?
Isn't the information required to debug the 'failed node' already shown in the CLI?
Would you be up for a future user interview about your experience with this feature and Kedro-Viz ?

lukaszdz commented 1 year ago

I dont understand the question
I'm not sure what that screen is, but seeing something in the CLI is not as useful as seeing it visually
I'm down for a 15min user interview, but I do not use kedro at all, because either a) onboarding is too complicated or b) I can't easily build the data pipelines I want, if at all.

noklam commented 1 year ago

@lukaszdz thank you for the feedback. For 3. could you give a little bit of details what kind of data pipeline you are trying to build and kedro fails you?

lukaszdz commented 1 year ago

I would like to be able to easily create a kedro data pipeline and call that from within a function that already exists in my code base. I installed kedro and tried to figure out how to do this from the documentation; couldn't figure it out. Then asked in slack and got a couple answers, which I haven't tried yet. I feel like doing something this simple should take me less than 10 minutes to do, and it should be very very easy/self-evident from the documentation.
I would like to be able to create a data pipeline where I can run jobs across partitions of data. Some examples of this existing in other frameworks are called window functions. For example, I have 500M rows, split by city name, and would like to create and run a node: one node for each market, as part of a pipeline. I wanted to do this a couple years ago, so not sure if this has gotten easier.

NeroOkwa commented 1 year ago

@lukaszdz pls share an email address with which I can book the user interview session. Thanks.

lukaszdz commented 1 year ago

lukasz.apps@gmail.com

NeroOkwa commented 1 year ago

@lukaszdz, the session has been booked for today 18/09/23.

kedro-org / kedro-viz