Open yangyxt opened 2 years ago
Hey @yangyxt
Thanks for raising this issue. I tried to look into it and test with a synthetic dataframe. I included a NaN in the groups and didn't encounter this issue.
Looking more closely at your error message, it looks as though you may have a tuple in your groupby column.
TypeError: '<' not supported between instances of 'int' and 'tuple'
Can you check if the column gene_col
is entirely of type int
only?
Hi @jmcarpenter2 I have the same issue, but it is unrelated to dropna in my case. After lots of debugging I can confirm that this error occurs under the following circumstances:
Some of these requirements seem very arbitrary, so it may just be a sporadic error. Below is a script that produces the error. I have tested it on two different machines. However, I have also had other scripts that produced the error on one machine, but not the other.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import swifter
import platform
import ray
import psutil
import multiprocessing
print(pd.__version__)
print(swifter.__version__)
print(ray.__version__)
print(platform.python_version())
print(platform.platform())
print(psutil.virtual_memory().total/1000000000, 'GB')
print(multiprocessing.cpu_count())
def foo(group):
group = group.sort_values('sort_col')
return group
data = []
row1 = ['a', 1, 1, datetime(2023, 1, 1)]
row2 = ['b', 2, 2, datetime(2023, 1, 1)]
cols = ['group_col1', 'group_col2', 'sort_col', 'timestamp_col']
data = [row1]*1 + [row2]*5000 #This works: [row1]*17 + [row2]*5000
df = pd.DataFrame(data, columns=cols)
df.swifter.groupby(['group_col1', 'group_col2']).apply(foo)
Output:
1.3.5
1.3.4
2.1.0
3.7.5
Windows-10-10.0.19041-SP0
34.358714368 GB
8
0%| | 0/2 [00:00<?, ?it/s]
2023-01-27 16:15:27,963 INFO worker.py:1528 -- Started a local Ray instance.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.54s/it]
Traceback (most recent call last):
File ".\swifter_error.py", line 30, in <module>
df.swifter.groupby(['group_col1', 'group_col2']).apply(foo)
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\swifter\swifter.py", line 661, in apply
return self._ray_apply(func, *args, **kwds)
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\swifter\swifter.py", line 650, in _ray_apply
return pd.concat(ray.get(apply_chunks), axis=self._axis).sort_index()
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\frame.py", line 6402, in sort_index
key,
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 4545, in sort_index
target, level, ascending, kind, na_position, sort_remaining, key
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 92, in get_indexer_indexer
target, kind=kind, ascending=ascending, na_position=na_position
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 391, in nargsort
return items.argsort(ascending=ascending, kind=kind, na_position=na_position)
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\arrays\base.py", line 633, in argsort
mask=np.asarray(self.isna()),
File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 403, in nargsort
indexer = non_nan_idx[non_nans.argsort(kind=kind)]
TypeError: '<' not supported between instances of 'tuple' and 'int'
Thank you for this very clear and reproducible code and logging! I will look into this shortly
I tried running this code locally and did not run into the issue.. The only major difference I am seeing between our environments is that yours is Windows. I am going to start a new initiative to start testing this code on Windows machines as well as part of my CI. Also related to #175 #148 and potentially #176
Added Windows CI but it didnt uncover anything :/
I found that the swifter groupby apply chain will encounter the error when trying to sort index, if I set dropna to False for the groupby step.
Here is the error log:
Traceback (most recent call last): File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 76, in wrapper result = func(*args, **kwargs) File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 484, in BP2_PM3_compound_with_patho return df.swifter.groupby([gene_col], as_index=False, dropna=False).apply(check_compound_per_gene, File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 661, in apply return self._ray_apply(func, *args, **kwds) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 650, in _ray_apply return pd.concat(ray.get(apply_chunks), axis=self._axis).sort_index() File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/frame.py", line 6447, in sort_index return super().sort_index( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/generic.py", line 4685, in sort_index indexer = get_indexer_indexer( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 94, in get_indexer_indexer indexer = nargsort( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 417, in nargsort indexer = non_nan_idx[non_nans.argsort(kind=kind)] TypeError: '<' not supported between instances of 'int' and 'tuple' ERROR:2022-09-28 13:40:29,310:wrapper:83:Exception raised in main_anno_process. exception: '<' not supported between instances of 'int' and 'tuple'
The dataframe put to use swifter.groupby() has a common numerical index. From 0 to len(df). The groupby column might have some rows with NA values and I do wish to keep them. I guess that's why this issue happened. I 'm not sure whether this can be fixed or optimized. Pls take a look.