intake.open_esm_datastore(huge_multi_var_catalogue) MemoryError

aaronspring commented 3 years ago

Description

I created a csv with 2 943 214 entries=files of non-cmorized data but multi-var 1-year files. When opening, I get a MemoryError, because the unique checking. Those files have up to 200 variables.

What I Did

import intake # most recent intake-esm version
import ast
col = intake.open_esm_datastore(
    "path/catalogue.json",
    csv_kwargs={"converters": {"variable_id": ast.literal_eval}},
)

```python --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj) 916 method = get_real_method(obj, self.print_method) 917 if method is not None: --> 918 method() 919 return True 920 /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in _ipython_display_(self) 535 from IPython.display import HTML, display 536 --> 537 contents = self._repr_html_() 538 display(HTML(contents)) 539 /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in _repr_html_(self) 524 Mainly for IPython notebook 525 """ --> 526 uniques = pd.DataFrame(self.nunique(), columns=['unique']) 527 text = uniques._repr_html_() 528 output = f'

{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s):

{text}' /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in nunique(self) 760 """ 761 --> 762 uniques = self.unique(self.df.columns.tolist()) 763 nuniques = {} 764 for key, val in uniques.items(): /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in unique(self, columns) 818 819 """ --> 820 return _unique(self.df, columns) 821 822 def to_dataset_dict( /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/search.py in _unique(df, columns) 16 for col in columns: 17 values = df[col].dropna().values ---> 18 uniques = np.unique(list(_flatten_list(values))).tolist() 19 info[col] = {'count': len(uniques), 'values': uniques} 20 return info <__array_function__ internals> in unique(*args, **kwargs) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis) 259 260 """ --> 261 ar = np.asanyarray(ar) 262 if axis is None: 263 ret = _unique1d(ar, return_index, return_inverse, return_counts) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/core/_asarray.py in asanyarray(a, dtype, order) 136 137 """ --> 138 return array(a, dtype, copy=False, order=order, subok=True) 139 140 MemoryError: Unable to allocate 21.5 GiB for an array with shape (134126174,) and data type 345 return method() 346 return None 347 else: /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in _repr_html_(self) 524 Mainly for IPython notebook 525 """ --> 526 uniques = pd.DataFrame(self.nunique(), columns=['unique']) 527 text = uniques._repr_html_() 528 output = f'

{self.esmcol_data["id"]} catalog with {len(self)} dataset(s) from {len(self.df)} asset(s):

{text}' /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in nunique(self) 760 """ 761 --> 762 uniques = self.unique(self.df.columns.tolist()) 763 nuniques = {} 764 for key, val in uniques.items(): /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/core.py in unique(self, columns) 818 819 """ --> 820 return _unique(self.df, columns) 821 822 def to_dataset_dict( /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/intake_esm/search.py in _unique(df, columns) 16 for col in columns: 17 values = df[col].dropna().values ---> 18 uniques = np.unique(list(_flatten_list(values))).tolist() 19 info[col] = {'count': len(uniques), 'values': uniques} 20 return info <__array_function__ internals> in unique(*args, **kwargs) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis) 259 260 """ --> 261 ar = np.asanyarray(ar) 262 if axis is None: 263 ret = _unique1d(ar, return_index, return_inverse, return_counts) /work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/numpy/core/_asarray.py in asanyarray(a, dtype, order) 136 137 """ --> 138 return array(a, dtype, copy=False, order=order, subok=True) 139 140 MemoryError: Unable to allocate 21.5 GiB for an array with shape (134126174,) and data type ```

Solution?

Maybe use dask.dataframe instead of pandas.dataframe? Is the unique call strictly needed at import?
just created smaller subcatalogs, that works

andersy005 commented 3 years ago

Thank you for a thorough report, @aaronspring! Are you able to open the CSV externally via pandas? If, so can you post here the content of df.info()?

aaronspring commented 3 years ago

yes. that works but I get a warning, see


df = pd.read_csv('catalogue.csv.gz')
/work/mh0727/m300524/conda-envs/mistral/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (1) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2943214 entries, 0 to 2943213
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   experiment_id   object 
 1   time_range      object 
 2   stream_id       object 
 3   institution_id  object 
 4   source_id       object 
 5   model_id        object 
 6   grid_label      object 
 7   path            object 
 8   member_id       int64  
 9   dcpp_init_year  float64
 10  variable_id     object 
dtypes: float64(1), int64(1), object(9)
memory usage: 247.0+ MB

andersy005 commented 3 years ago

It seems that your system's Linux kernel allows overcommitting memory and NumPy is abusing that by over-allocating the amount of memory needed.

I am going to look into ways to get unique values with 1) limited involvement of Numpy, 2) less amount of memory

https://github.com/intake/intake-esm/blob/ae422c60608826ef1cb3f65d200df8d8d5ec0269/intake_esm/search.py#L18

I will get back to you once I have a working solution

andersy005 commented 3 years ago

@aaronspring,

When you get a chance, do you mind trying https://github.com/intake/intake-esm/pull/313 and letting me know how it goes?

python -m pip install git+https://github.com/andersy005/intake-esm.git@fix-memory-error

aaronspring commented 3 years ago

great. the error doesnt occur anymore. the warning remains.

andersy005 commented 3 years ago

Awesome! Thank you, @aaronspring

Regarding the warning, it is my understanding that Pandas is encountering a mix of dtypes in one of your column during its dtype inference heuristics. My recommendation is to enforce the dtype (if you already know it) via the csv_kwargs:

col = intake.open_esm_datastore(
    "path/catalogue.json",
    csv_kwargs={"converters": {"variable_id": ast.literal_eval}, "dtype": {'problematic_column': 'string'},
)

intake / intake-esm