pyapp-kit / superqt

Missing widgets and components for Qt-python
https://pyapp-kit.github.io/superqt/
BSD 3-Clause "New" or "Revised" License
210 stars 38 forks source link

WIP: Dataframe Editor #57

Closed ppwadhwa closed 2 years ago

ppwadhwa commented 2 years ago

Adds a dataframe editor widget that uses code from Spyder's dataframe editor. I tried to pull out just the code that I thought we needed here and tweaked things to make it work. Take a look and let me know what you think!

This PR addresses to https://github.com/napari/napari/issues/3512. I'm not entirely sure if this is the location we want to keep it, but I wanted to get it somewhere for viewing.

Here is a short script to test it out yourself:

from PyQt5.QtWidgets import QApplication
import pandas as pd
from dataframeeditor import DataFrameEditor
import sys

app = QApplication(sys.argv)

window = DataFrameEditor()
df = pd.DataFrame({'a': [1,2,3], 'b': ['a', 'b', 'c'], 'c':[40, 20, 30], 'd': [.21312, 343, 0.343]})
window.setup_and_check(df, title='Test 123')
window.show()  
app.exec()

https://user-images.githubusercontent.com/54282105/150072182-c01f42e6-671e-440b-ac1f-19d75ba8d3d8.mov

andy-sweet commented 2 years ago

I still need to review the code, but I pulled this branch down, tested this a little, and it mostly looks great! Some observations.

Screen Shot 2022-01-19 at 11 22 18 AM

I'll hopefully take a look at the code later today, or tomorrow. As this is vendored code, I think it would be useful to include a link to the original source and to try to clarify where we have made significant changes.

goanpeca commented 2 years ago

Hi @ppwadhwa thanks for working on this, seems to be working as expected :)

I have some comments/suggestions on updates:

tlambert03 commented 2 years ago

Looking good. I agree with all of @goanpeca's comments. I do think we should think carefully about the API, giving each public method some scrutiny, possibly renaming.

Also, would want to see complete tests, and probably an example or two, before this gets merged. We've generally been very careful and thorough with testing in superqt, so it's possible that this vendored code might not work in one or more of our test setups

tlambert03 commented 2 years ago

another general comment here. I think if we're going to put this in superqt, we should "own" it so to speak, and not put it in _vendor (though do include the license and attribution at the top of the file). But feel free to change the API as desired, with no aim of being mergeable with upstream changes.

Also just want to mention: @jni and I had a rather lengthy discussion about Table/Dataframe widget APIs over in https://github.com/napari/magicgui/pull/61 (which we were imagining at the time would accomplish something like this). So it might be worth a read through some of what was discussed over there and – if not terribly difficult – try to have some degree of parity between that one and this one.

ppwadhwa commented 2 years ago

@tlambert03 will take a look!

tlambert03 commented 2 years ago

looks like tests need pandas to run. Chatted with @goanpeca about this. Let's add an extra to setup.cfg called table or dataframe or something, and add the pandas dependency to it. Then add pandas to the testing requirements as well

codecov[bot] commented 2 years ago

Codecov Report

Merging #57 (07218e0) into main (329eaaa) will decrease coverage by 3.16%. The diff coverage is 73.08%.

:exclamation: Current head 07218e0 differs from pull request most recent head fc8d744. Consider uploading reports for the commit fc8d744 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
- Coverage   84.58%   81.42%   -3.17%     
==========================================
  Files          24       28       +4     
  Lines        2108     3009     +901     
==========================================
+ Hits         1783     2450     +667     
- Misses        325      559     +234     
Impacted Files Coverage Δ
src/superqt/dataframeeditor/_dataframeeditor.py 72.98% <72.98%> (ø)
src/superqt/__init__.py 100.00% <100.00%> (ø)
src/superqt/dataframeeditor/__init__.py 100.00% <100.00%> (ø)
src/superqt/sliders/_labeled.py 79.79% <0.00%> (-2.87%) :arrow_down:
src/superqt/fonticon/_qfont_icon.py 92.70% <0.00%> (-0.73%) :arrow_down:
src/superqt/utils/_qthreading.py 77.05% <0.00%> (-0.68%) :arrow_down:
src/superqt/utils/__init__.py 100.00% <0.00%> (ø)
src/superqt/utils/_throttler.py 90.00% <0.00%> (ø)
src/superqt/utils/_misc.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 329eaaa...fc8d744. Read the comment docs.

ppwadhwa commented 2 years ago

@tlambert03 I know I need to check into the failing tests--I will do that. Did I update the setup.cfg file properly?

ppwadhwa commented 2 years ago

@tlambert03 @sofroniewn @jni @goanpeca

Hi! I wanted to get some feedback regarding the API for the dataframe editor. Below are some public methods that are either already implemented or I thought would be good to implement.

Could you let me know what you think of these and if there is some functionality you think is missing?

class QDataFrameEditor(QWidget):

   # Signal emitted when value of dataframe changes.
   valueChanged = Signal(pd.DataFrame)

   def __init__(self, data: pd.DataFrame, parent: Optional[QWidget] = None, **kwargs) -> None:
        """Creates a widget with Pandas dataframe."""

   def getValue(self) -> pd.DataFrame:
        """Return current dataframe object."""

   def setData(self, row, col, value):
       """Set a value in the dataframe viewer at a certain row and column location."""

   def addRow(self) -> None:
        """Add row to the dataframe viewer."""

   def addColumn(self) -> None:
        """Add column to the dataframe viewer."""  (not implemented yet)

   def deleteRow(self) -> None:
       """Delete row from the dataframe viewer."""  (not implemented yet)

   def deleteColumn(self) -> None:
        """Delete column to the dataframe viewer."""  (not implemented yet)

   def enableBackgroundColor(self, value: bool) -> None:
        """Colors background in columns based on scale of values in each column"""
tlambert03 commented 2 years ago

thanks @ppwadhwa ... couple questions:

  1. valueChanged = Signal(pd.DataFrame) - will this collect and emit the entire data frame every time any single cell changes? If so, are there any performance concerns there? I wonder if we might want to have a more granular signal, something like a signal with the row, column, and new value. (see magicgui table change event)
  2. Will there be a way to edit a column or row, or will one have to loop through their data manually, calling setData(row, column, value) for every item in the row/column?
  3. Along those lines, will there be an api for adding a new non empty column/row? Something like addColumn(header, data)
  4. deleteRow/deleteColumn don't take inputs... was that just an oversight? or would the always delete a specific/laster column/row?
  5. insertColumn(index) / insertRow(index)?

one of the bits of the magicgui table API that I was most happy with and that we discussed for a while is the __getitem__, __setitem__, __delitem__ methods here. This allows for numpy-like indexing and editing onto the 2D data array, For example table.data[0, 2] gets the data in the cell of the first row, 3rd column. table.data[:10, :2] gets the first 10 rows in the first 2 columns, etc.... Also, these lines here let us get/set/append/delete columns with table['header'] = [...].

(I'm sorry to keep coming back to that table PR in magicgui, but I would like to have a solid understanding of what is being gained here in comparison to that existing implementation, so we can best take advantage of it, as well as what is being lost here compared to that existing table editor, hopefully so we can maintain what already works well).

thanks!

ppwadhwa commented 2 years ago

thanks @tlambert03 for your comments/questions.

I hope others will still weigh in on this, but I thought I'd share what I think/know...

  1. I didn't think about this too hard. Thinking about it, emitting the whole dataframe is not a good idea, especially if its large. I originally thought that it could get complicated depending on the change that was made --- whether you change a value, add a value, or delete a value. so, just emitting the dataframe seemed like the easiest option, but that is not going to work. My next thought is to create a dictionary with the row, col, value, and a flag that indicates whether it was added, deleted or changed. If we have a row/column add/delete that may mean returning lists of values along with lists of rows, cols. What do you think?
  2. I like the idea of being able to edit a whole row, col. I think that should be there.
  3. I also like being able to set the data with non-empty values. I think that should be there too.
  4. That was just an oversight. This needs input values.
  5. Yes, need to be able to add a row/column at a certain index. I guess I'd like to do that within with the add row/column functions, but should that be separate?

As far as what is different about this widget compared to the magic gui widget---My understanding is that this one has multi-indexing support and lazy loading for larger dataframes. For display, it has color coding based on scale and easy formatting of float values.

andy-sweet commented 2 years ago

Rather than dropping low-level/line comments on this WIP PR (which I can also do if desired), maybe I can give more context around how I expect napari to use the widget here or alternatively magicgui.Table. I appreciate that's not the only goal here, but I see it as the primary one.

In summary, one advantage of the contribution here is that it assumes the tabular data is a pandas.DataFrame. That's a little risky for Layer.features (see below), but it should have some performance benefits - particularly with regards to copying data. Add that to the performance benefits of the promised lazy loading here, and this PR becomes enticing when Layer.features is big.

The widget here also has some extra functionality compared to magicgui.Table and some missing functionality that could be implemented in this PR or in follow-ups.

My gut instinct is that we should initially use magicgui.Table to implement https://github.com/napari/napari/issues/3512, as long that doesn't introduce any behavior that we couldn't easily implement with the widget here. I think that task is relatively simple, though we have to think about the graphical design of how to open the features table associated with the layer. Then when we see how that behaves and how it is received, we could more easily motivate the widget here if it's lacking or non-performant. How does that sound?

More details/notes below!

napari's Layer.features getter may not return a pandas.DataFrame in the future

The public API of the Layer.features getter doesn't guarantee that it returns a pandas.DataFrame, only a dataframe-like. That's not well defined at the moment, but we might aim to define a LayerFeaturesProtocol (in a similar spirit to LayerDataProtocol) that is implemented by some of the dataframe-like libraries (e.g. pandas, cudf, vaex, etc.). Right now, the underlying implementation actually uses a pandas.DataFrame, so that's what is returned by the getter.

Of course we can rely on that implementation detail for now, but I'd be worried about depending on too much of the very large pandas.DataFrame API. We should try to limit ourselves to the basics like shape, __getitem__, __setitem__, __delitem__, iloc, and maybe iat that are likely to supported by most implementations or would be easy to wrap/implement ourselves.

napari's Layer.features setter coerces the input and flattens a multi-index on the rows

We also do some coercion in the setter, including flattening the row-index to use the simple/default RangeIndex. We might want to support fancier row-indices (i.e. labels and multi-indices) in the future, though that may not be compatible with some of the design decisions of the Data APIs thoughts on a dataframe protocol, which we might rely on. Therefore, support for multi-index rows is not an immediate priority. Would be interesting if column multi-indexing is supported by the implementation in this PR, though that may also not play nicely with current thoughts about a simple dataframe protocol (instead a tidy-data structure is probably the safe play here).

Copies and references

Right now, magicgui's Table will in general make a copy of the underlying data Layer.features because when the input is a pandas.DataFrame it calls pandas.DataFrame.values, which only creates a view of the underlying array when all the columns share the same dtype, and when that dtype is supported by numpy (i.e. it's not a pandas extension type like CategoricalDtype).

I think the only concern here is the performance of initializing a table with duplicate data, mostly in terms of memory usage. I suspect in practice that won't be a big concern as I expect Layer.features to be small in comparison to Layer.data, but maybe I'm wrong.

tlambert03 commented 2 years ago

this appears to be inactive. Will close soon unless instructed otherwise.