man-group / dtale

Visualizer for pandas data structures
http://alphatechadmin.pythonanywhere.com
GNU Lesser General Public License v2.1
4.62k stars 390 forks source link

Show duplicates: Bug or wanted behaviour ? #840

Closed legaultpierre closed 5 months ago

legaultpierre commented 5 months ago

Hello !

First, thanks for your work, I just discovered your library and I already love it !

Context

I am using your tool on MSLR-WEB10K > Fold1 > train dataset. I want to know what the duplicates are.

How the behaviour / error happened

Once the GUI launched with the code in the following section, I navigate to Visualize > Duplicates > Show Duplicates > View Duplicates. Once clicked, I get the following error:

Traceback (most recent call last):
  File "/home/pierre/02-perso/02-toy-projects/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3791, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 152, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 181, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/pierre/02-perso/02-toy-projects/.venv/lib/python3.10/site-packages/dtale/views.py", line 119, in _handle_exceptions
    return func(*args, **kwargs)
  File "/home/pierre/02-perso/02-toy-projects/.venv/lib/python3.10/site-packages/dtale/views.py", line 1935, in get_duplicates
    return jsonify(results=duplicate_check.test())
  File "/home/pierre/02-perso/02-toy-projects/.venv/lib/python3.10/site-packages/dtale/duplicate_checks.py", line 39, in test
    return self.checker.check(data)
  File "/home/pierre/02-perso/02-toy-projects/.venv/lib/python3.10/site-packages/dtale/duplicate_checks.py", line 185, in check
    duplicates = df[group].reset_index().groupby(group).count()
  File "/home/pierre/02-perso/02-toy-projects/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3893, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/pierre/02-perso/02-toy-projects/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3798, in get_loc
    raise KeyError(key) from err
KeyError: None

Code to reproduce

import dtale
import dtale.global_state as global_state

import os

import pandas
from sklearn.datasets import load_svmlight_file

if __name__ == '__main__':
    dataset_folder = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources", "MSLR-WEB10K")

    fold = 1
    fold_path = os.path.join(dataset_folder, f"Fold{fold}")
    train_path = os.path.join(fold_path, "train.txt")
    X_train, y_train, qid_train = load_svmlight_file(
        train_path, query_id=True, dtype=np.float32
    )
    df = pandas.concat([pandas.DataFrame(y_train), pandas.DataFrame(qid_train), pandas.DataFrame(X_train.toarray())],
                       axis=1)
    df.columns = ["relevance_level", "query_id", *[f"feat_{i}" for i in range(0, 136)]]

    df['relevance_level'] = df['relevance_level'].astype('int')

    global_state.set_chart_settings({'scatter_points': 15000, '3d_points': 40000})
    dtale.show(df, subprocess=False)

Lib versions in env:

[tool.poetry.dependencies]
python = "~3.10"
pandas = "2.1.4"
scikit-learn = "1.3.2"
sweetviz = "^2.3.1"
# force kaleido to prevent error on installing dtale 0.2.1.post1
kaleido = "0.2.1"
dtale = "^3.9.0"

Question

Is this a wanted behaviour ? (You want to force people to select some columns) Is this a bug ?

If this is the wanted behaviour, please consider to add the functionality of showing the duplicate rows without having to select columns: in datasets with a lot of features, it is quite annoying to select them all !

Thanks in advance :-)

T-a-c-h-y-o-n commented 5 months ago

When there are so many variables, it can be really tiring to choose each one one by one. It would be great if there was an option to select all to see and delete duplicate rows.

aschonfeld commented 5 months ago

@legaultpierre when clicking "View Duplicates" did you select a column from the dropdown above the button? If you don't then you will hit this error. I can update the UI to disable the button until a column is selected, but for now that will solve your problem

@T-a-c-h-y-o-n I'll look into adding a "select all" option

aschonfeld commented 5 months ago

@legaultpierre just released v3.10.0 to pypi (should be on conda-forge soon) with this update to hide the "View Duplicates" button included.

Also, if you haven't already, please put your ⭐ on the repo when you get a sec. Thanks! 🙏

legaultpierre commented 1 month ago

Hello @aschonfeld , sorry for the (very) late response, work life has better a little bit overwhelming ! Thanks a lot for the changes, that's what I needed!