Closed aaronspring closed 3 years ago
Thank you for a thorough report, @aaronspring! Are you able to open the CSV externally via pandas? If, so can you post here the content of df.info()
?
yes. that works but I get a warning, see
df = pd.read_csv('catalogue.csv.gz')
/work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (1) have mixed types.Specify dtype option on import or set low_memory=False.
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2943214 entries, 0 to 2943213
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 experiment_id object
1 time_range object
2 stream_id object
3 institution_id object
4 source_id object
5 model_id object
6 grid_label object
7 path object
8 member_id int64
9 dcpp_init_year float64
10 variable_id object
dtypes: float64(1), int64(1), object(9)
memory usage: 247.0+ MB
It seems that your system's Linux kernel allows overcommitting memory and NumPy is abusing that by over-allocating the amount of memory needed.
I am going to look into ways to get unique values with 1) limited involvement of Numpy, 2) less amount of memory
I will get back to you once I have a working solution
@aaronspring,
When you get a chance, do you mind trying https://github.com/intake/intake-esm/pull/313 and letting me know how it goes?
python -m pip install git+https://github.com/andersy005/intake-esm.git@fix-memory-error
great. the error doesnt occur anymore. the warning remains.
Awesome! Thank you, @aaronspring
Regarding the warning, it is my understanding that Pandas is encountering a mix of dtypes in one of your column during its dtype inference heuristics. My recommendation is to enforce the dtype (if you already know it) via the csv_kwargs
:
col = intake.open_esm_datastore(
"path/catalogue.json",
csv_kwargs={"converters": {"variable_id": ast.literal_eval}, "dtype": {'problematic_column': 'string'},
)
Description
I created a
csv
with 2 943 214 entries=files of non-cmorized data but multi-var 1-year files. When opening, I get aMemoryError
, because the unique checking. Those files have up to 200 variables.What I Did
{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s):
{text}' /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in nunique(self) 760 """ 761 --> 762 uniques = self.unique(self.df.columns.tolist()) 763 nuniques = {} 764 for key, val in uniques.items(): /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in unique(self, columns) 818 819 """ --> 820 return _unique(self.df, columns) 821 822 def to_dataset_dict( /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/search.py in _unique(df, columns) 16 for col in columns: 17 values = df[col].dropna().values ---> 18 uniques = np.unique(list(_flatten_list(values))).tolist() 19 info[col] = {'count': len(uniques), 'values': uniques} 20 return info <__array_function__ internals> in unique(*args, **kwargs) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis) 259 260 """ --> 261 ar = np.asanyarray(ar) 262 if axis is None: 263 ret = _unique1d(ar, return_index, return_inverse, return_counts) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/core/_asarray.py in asanyarray(a, dtype, order) 136 137 """ --> 138 return array(a, dtype, copy=False, order=order, subok=True) 139 140 MemoryError: Unable to allocate 21.5 GiB for an array with shape (134126174,) and data type{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s):
{text}' /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in nunique(self) 760 """ 761 --> 762 uniques = self.unique(self.df.columns.tolist()) 763 nuniques = {} 764 for key, val in uniques.items(): /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in unique(self, columns) 818 819 """ --> 820 return _unique(self.df, columns) 821 822 def to_dataset_dict( /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/search.py in _unique(df, columns) 16 for col in columns: 17 values = df[col].dropna().values ---> 18 uniques = np.unique(list(_flatten_list(values))).tolist() 19 info[col] = {'count': len(uniques), 'values': uniques} 20 return info <__array_function__ internals> in unique(*args, **kwargs) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis) 259 260 """ --> 261 ar = np.asanyarray(ar) 262 if axis is None: 263 ret = _unique1d(ar, return_index, return_inverse, return_counts) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/core/_asarray.py in asanyarray(a, dtype, order) 136 137 """ --> 138 return array(a, dtype, copy=False, order=order, subok=True) 139 140 MemoryError: Unable to allocate 21.5 GiB for an array with shape (134126174,) and data typeSolution?
dask.dataframe
instead ofpandas.dataframe
? Is the unique call strictly needed at import?