gradio-app / gradio

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
http://www.gradio.app
Apache License 2.0
33.39k stars 2.52k forks source link

Dataframe Improvements #962

Closed pngwn closed 2 years ago

pngwn commented 2 years ago

There are a number of issues with the Dataframe component as it exist today and we need to do some work to fix the outstanding but also improve the usability for humans.

We can use this issue to keep track of the issues that have been reported and come up with a design that addresses the usability issues. I'll start with a simple proposal and we can discuss from there.

python API Changes

Today the dataframe API looks like this:

Dataframe(
  headers=None, 
  row_count=3, 
  col_count=3, 
  default_value=None, 
  datatype="str", 
  label=None, 
  col_width=None, 
  type="pandas", 
  optional=False
)

modifying column width

Proposal: _col_width should be removed._

I think allowing users to set the column width is not a good idea. The whole purpose of gradio is to generate high quality web apps to share and showcase models. This should work across device sizes and screen widths. Tables automatically adapt column widths to accomodate their content, providing an API that will almost definitely break the UI is not 'pit of success' stuff.

fixed column and row count

Proposal: _col_count and col_width should take either a number or a tuple of `(number, "fixed"|"dynamic")._

in #868 @osanseviero wrote:

Allow making the number of columns or rows fixed, since for some use cases you don't want users creating new rows.

We do not currently have a mechanism to prevent end-users from creating new columns and rows. I think we have two options here:

Note: We could rename these kwargs to col, and row

I propose the second option (tuple) but I do not feel strongly about it.

conflicts and confusements with kwargs relating to col and row quantity

Proposal: _headers, col_count, row_count, default_value should be validated to ensure there are no conflicts_.

More specifically: Any combination of kwargs that can set the column count must always equal the same number of columns. Any kwargs that can set the row count must not result in provided data being hidden.

headers and col_count can conflict; default_value and headers can conflict (kinda); default_value, col_count, row_count and default_value can conflict.

This is easiest to explain with examples.

This is confusing but not necessarily an issue:

Dataframe(
  col_count=3,
  headers=["One", "Two"]
)

However this is just wrong and will lead to unexpected behaviour:

Dataframe(
  col_count=2,
  headers=["One", "Two", "Three"]
)

What should happen here:

Dataframe(
  headers=["One", "Two", "Three"]
  default_value=[[1, 2], [3, 4]]
)

And here:

Dataframe(
  default_value=[[1, 2, 3, 4], [5, 6, 7, 8]],
  col_count=2
)

We need to figure out simple rules to validate datafram inputs that affect the ciolumns + widths, or decide how to normalise.

Some possible rules aimed at removing ambiguity:

The obvious counter to this is that we could add additional values to default_value or headers to 'fill in the gaps' but I think the API will be far easier to reason about for users if we have clear rules. It will allow us to easily detect errors and provide helpful messages to users. Trying to guess what users want without being explicit is how perl happened.

This validation would happen at python time, and we could provide error messages like:

`col_count` is 3 but you passed 2 headers. Set `col_count` to 2 or add 1 header, even if it is an empty string.
`col_count` is 2 but you passed 4 headers. Set `col_count` to 4 or remove 2 headers.
default_value contains data for 4 columns but col_count is set to 2. Set col_count to 4 or remove some data from `default_value`

proposed python API

Dataframe(
  headers=None,             # validated
  row_count=(3, "dynamic")  # validated
  col_count=(3, "dynamic")  # validated
  default_value=None,       # validated
  datatype="str", 
  label=None, 
  type="pandas", 
  optional=False
)

UX improvements

make cells easier to interact with

in #868 @osanseviero said:

Modifying the input of a cell requires double-clicking on it. I would love to be able to just click and add my input. There were also a couple of dev experience improvements I would love to see

I'm not certain about this.

The current behaviour mimics how most spreadsheets work but users of spreadhseets freequently move around the spreadsheet befopre editing. Our dataframe is not a powertool but a quick user entry tool, so perhaps ease of data entry is more important than ease of cell navigation.

If we change click behavioour, we also need to change keyboard behaviour for parity of usability. Essentially this feature request is to remove the different 'states' from the dataframe, so that it is essentially 'edit only', rather than having view/ edit modes as without click triggering that state it would be impossible to get to. Static or output dataframes would still have this behaviour.

@omarespejel Could you add some mroe details about how you would like to interact with the dataframe. Not just click but how would you like to change to a different cell, how would that work for keyboard users who do not or cannot use a mouse?

better inputs when the datatype is given

in #868 @osanseviero said:

With using datatype="number", I would love if users can only write numbers and not strings. I know this modifies what is passed to the interface, but modifying the user input type as well would be great.

Currently everything is treated as a string by the frontend, even when we know the datatype. I think we can improve this significantly.

But I think we can go further if we expanded the datatype kwarg.

Be good to get your thoughts @gary149

populate dataframe fom file

@merveenoyan create issued #945 discussing uploading csv/ tsv files into the dataframe.

I think this makes perfect sense. We are alrady doing this for the timeseries, adapting it for the dataframe should be straightforward.

@merveenoyan could you clarify the first part of that issue. Are you saying it would be good for the dataframe to accept different values in the python library (i.e. the default_value kwarg)? or that It would be good to be able to modify what is displayed after a user uploads the file via the UI (i.e. only showing the first/ last 5 rows, etc.)?

row + col creation and deletion

631

Row and column creation and deletion needs some work. Deletion isn't currently possible.

Would love to get people's thoughs on what good creation and deletion might look like, are there other datatables you have seen in the wild that do this well while remaining very compact?

Another for @gary149

Redesign

Just putting this here for posterity. Things are being redesigned.

bugs

We have bugs:

Issues relating to features for tracking purposes:


Let me know if I have missed anything and would be good to get people's thoughts on this.

cc @abidlabs @aliabid94 @dawoodkhan82 @aliabd @FarukOzderim

merveenoyan commented 2 years ago

populate dataframe fom file @merveenoyan create issued https://github.com/gradio-app/gradio/issues/945 discussing uploading csv/ tsv files into the dataframe.

I think this makes perfect sense. We are alrady doing this for the timeseries, adapting it for the dataframe should be straightforward.

@merveenoyan could you clarify the first part of that issue. Are you saying it would be good for the dataframe to accept different values in the python library (i.e. the default_value kwarg)? or that It would be good to be able to modify what is displayed after a user uploads the file via the UI (i.e. only showing the first/ last 5 rows, etc.)?

so normally data scientists read CSV/TSV/XLSX files and turn them into dataframe using pandas and see the header of the dataframe afterwards, it's quite typical.

import pandas as pd
df = pd.read_csv(file_path) # this directly reads file into a dataframe
df.head(number_of_rows) # shows the first number_of_rows rows
df.tail(number_of_rows) # shows the last number_of_rows rows

people do this in colab and use it to demonstrate, meanwhile gradio doesn't have it. It has a dataframe which you can't read from a file. It would be much better for tabular data workflows if we had a component that reads from file into a dataframe. (like have a drag and drop interface that turns into dataframe directly after the file is uploaded) and then if there's a model running in the background it could do inference and put the results on outputs side. When you read a file into a dataframe no modification is needed, I feel like no one does that.

abidlabs commented 2 years ago

Beautiful! I think you got everything @pngwn

gary149 commented 2 years ago

Thanks for wrapping up @pngwn! here are my thoughts on some UX points:

I think allowing users to set the column width is not a good idea. The whole purpose of gradio is to generate high quality web apps to share and showcase models. This should work across device sizes and screen widths. Tables automatically adapt column widths to accomodate their content, providing an API that will almost definitely break the UI is not 'pit of success' stuff.

I agree 100% 👍

Modifying the input of a cell requires double-clicking on it. I would love to be able to just click and add my input. There were also a couple of dev experience improvements I would love to see

I agree with @osanseviero on this but to be clear a single click will not put your cursor in the cell, it will focus the cell exactly like today then only if there is keyboard interaction while a cell is focused it will edit it. We should just do it like Airtable does it: single click -> cell focus -> keyboard action -> input or double click -> toggle the cursor in the cell. It's a bit hard to explain so I think it's worth trying Airtable if you want to understand this behaviour.

image

But I think we can go further if we expanded the datatype kwarg.

enums/ unions would render a dropdown or autocompleting dropdown thing. we could support prefix + suffixes for currencies + measurements. `datatype=[()] it might be possible to support custom validators in the future as long as they are regex based.

I agree, maybe in could be shipped in a 2nd iteration. Airtable has done a great job with that too, it can be a good inspiration to start with:

image

Would love to get people's thoughs on what good creation and deletion might look like, are there other datatables you have seen in the wild that do this well while remaining very compact?

I think I can figure out something for this.

Last thing to note is that Dataframes are used in AutoTrain, Dataset viewer, and Gradio.