jmcarpenter2 / swifter

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner
MIT License
2.54k stars 101 forks source link

swifter.groupby() does not support with dropna=False #202

Open yangyxt opened 2 years ago

yangyxt commented 2 years ago

I found that the swifter groupby apply chain will encounter the error when trying to sort index, if I set dropna to False for the groupby step.

Here is the error log: Traceback (most recent call last): File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 76, in wrapper result = func(*args, **kwargs) File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 484, in BP2_PM3_compound_with_patho return df.swifter.groupby([gene_col], as_index=False, dropna=False).apply(check_compound_per_gene, File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 661, in apply return self._ray_apply(func, *args, **kwds) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 650, in _ray_apply return pd.concat(ray.get(apply_chunks), axis=self._axis).sort_index() File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/frame.py", line 6447, in sort_index return super().sort_index( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/generic.py", line 4685, in sort_index indexer = get_indexer_indexer( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 94, in get_indexer_indexer indexer = nargsort( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 417, in nargsort indexer = non_nan_idx[non_nans.argsort(kind=kind)] TypeError: '<' not supported between instances of 'int' and 'tuple' ERROR:2022-09-28 13:40:29,310:wrapper:83:Exception raised in main_anno_process. exception: '<' not supported between instances of 'int' and 'tuple'

The dataframe put to use swifter.groupby() has a common numerical index. From 0 to len(df). The groupby column might have some rows with NA values and I do wish to keep them. I guess that's why this issue happened. I 'm not sure whether this can be fixed or optimized. Pls take a look.

jmcarpenter2 commented 2 years ago

Hey @yangyxt

Thanks for raising this issue. I tried to look into it and test with a synthetic dataframe. I included a NaN in the groups and didn't encounter this issue.

Screen Shot 2022-09-28 at 12 37 30 PM

Looking more closely at your error message, it looks as though you may have a tuple in your groupby column.

TypeError: '<' not supported between instances of 'int' and 'tuple'

Can you check if the column gene_col is entirely of type int only?

fiskus2 commented 1 year ago

Hi @jmcarpenter2 I have the same issue, but it is unrelated to dropna in my case. After lots of debugging I can confirm that this error occurs under the following circumstances:

Some of these requirements seem very arbitrary, so it may just be a sporadic error. Below is a script that produces the error. I have tested it on two different machines. However, I have also had other scripts that produced the error on one machine, but not the other.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import swifter
import platform
import ray
import psutil
import multiprocessing

print(pd.__version__)
print(swifter.__version__)
print(ray.__version__)
print(platform.python_version())
print(platform.platform())
print(psutil.virtual_memory().total/1000000000, 'GB')
print(multiprocessing.cpu_count())

def foo(group):
    group = group.sort_values('sort_col')
    return group

data = []
row1 = ['a', 1, 1, datetime(2023, 1, 1)]
row2 = ['b', 2, 2, datetime(2023, 1, 1)]
cols = ['group_col1', 'group_col2', 'sort_col', 'timestamp_col']

data = [row1]*1 + [row2]*5000   #This works: [row1]*17 + [row2]*5000
df = pd.DataFrame(data, columns=cols)

df.swifter.groupby(['group_col1', 'group_col2']).apply(foo)

Output:

1.3.5
1.3.4
2.1.0
3.7.5
Windows-10-10.0.19041-SP0
34.358714368 GB
8
  0%|                                                                                            | 0/2 [00:00<?, ?it/s]
2023-01-27 16:15:27,963 INFO worker.py:1528 -- Started a local Ray instance.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.54s/it]
Traceback (most recent call last):
  File ".\swifter_error.py", line 30, in <module>
    df.swifter.groupby(['group_col1', 'group_col2']).apply(foo)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\swifter\swifter.py", line 661, in apply
    return self._ray_apply(func, *args, **kwds)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\swifter\swifter.py", line 650, in _ray_apply
    return pd.concat(ray.get(apply_chunks), axis=self._axis).sort_index()
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\frame.py", line 6402, in sort_index
    key,
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 4545, in sort_index
    target, level, ascending, kind, na_position, sort_remaining, key
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 92, in get_indexer_indexer
    target, kind=kind, ascending=ascending, na_position=na_position
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 391, in nargsort
    return items.argsort(ascending=ascending, kind=kind, na_position=na_position)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\arrays\base.py", line 633, in argsort
    mask=np.asarray(self.isna()),
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 403, in nargsort
    indexer = non_nan_idx[non_nans.argsort(kind=kind)]
TypeError: '<' not supported between instances of 'tuple' and 'int'
jmcarpenter2 commented 1 year ago

Thank you for this very clear and reproducible code and logging! I will look into this shortly

jmcarpenter2 commented 1 year ago

I tried running this code locally and did not run into the issue.. The only major difference I am seeing between our environments is that yours is Windows. I am going to start a new initiative to start testing this code on Windows machines as well as part of my CI. Also related to #175 #148 and potentially #176

Screen Shot 2023-03-24 at 11 19 10 AM

jmcarpenter2 commented 1 year ago

Added Windows CI but it didnt uncover anything :/