scikit-hep / hist

Histogramming for analysis powered by boost-histogram
https://hist.readthedocs.io
BSD 3-Clause "New" or "Revised" License
128 stars 25 forks source link

[BUG] MemoryError: std::bad_alloc #335

Open reckoner opened 3 years ago

reckoner commented 3 years ago

Describe the bug

The file I'm trying to process is not that big, but I keep getting this error.

Steps to reproduce

>>> x = pd.read_csv('agaricus-lepiota.csv')
>>> Hist.from_columns(x,x.columns)

An example that highlights the broken behaviour. Include OS and system info if relevant agaricus-lepiota.csv .

henryiii commented 3 years ago
>>> import pandas as pd
>>> import math
>>> 
>>> x = pd.read_csv('/Users/henryschreiner/Downloads/agaricus-lepiota.csv')
>>> lens = [len(set(x[s])) for s in x]
>>> # [2, 6, 4, 10, 2, 9, 2, 2, 2, 12, 2, 5, 4, 4, 9, 9, 1, 4, 3, 5, 9, 6, 7]
>>> tot = math.prod(lens)
>>> # You need:
>>> print(f"\nYou need {tot*8/1000**3:,.0f} GB of memory to store this histogram, assuming 8 bytes each cell (doubles)")

You need 1,950,397 GB of memory to store this histogram, assuming 8 bytes each cell (doubles)
reckoner commented 3 years ago

what about integers instead of floats or some smaller type?

jpivarski commented 3 years ago

The issue is with not enough memory, so int64 and double are equal (8 bytes each), int32 and float are equal (4 bytes each). To really make a savings you'd have to go down to int16 or int8. But these small integer types are slower in math operations and (more importantly) roll over quickly:

root [0] int16_t x = 32765
(short) 32765
root [1] ++x
(short) 32766
root [2] ++x
(short) 32767
root [3] ++x
(short) -32768   // negative?!?

You probably want to consider unequal bin sizes or otherwise sparsifying the space (i.e. use fewer bins to represent the same data), which are directly tied to the physics application, rather than deal with these technical computing issues.

reckoner commented 3 years ago

@jpivarski Thanks for your reply. BTW, I am a big fan of your work with awkward array. This is not a physics application, so I'm not sure what else to do here. The data I have is categorical and I have a lot of it. I'm talking about 300-400 columns, each with 5-10 categories each!

henryiii commented 3 years ago

I don't think there's much issue with rollover in such a large space. We do have an Unlimited datatype, which starts at 8 bit ints and grows as needed. But using 8x less data doesn't really have any impact on the problem. There has been some thought into sparse data storages, but we don't have that yet. And, we also have a hard limit of 32 axes, since 2^32 is more memory than anyone should be using, but also for technical reasons we can't leave it[^1] unlimited and compute a fill method.

[^1]: "It" being the number of axes present in the histogram.

What sort of analysis are you planning on doing with the data? Usually histograms are used to reduce data, not expand it. Once you know what you are doing, you might find one histogram per 1-4 columns or so might tell you what you need to know much better than the anything with the entire dataset.

reckoner commented 3 years ago

We are doing a reduction on some medical "encounter" data (i.e., not images or other sensor data). Your point is on target: I would love to find 1-4 columns that can summarize the data, but I don't have a model that tells which four or five columns those should be, so I have to paw around for meaningful subsets (i.e., smaller histograms) that have enough data to be statistically significant.

Thank you!

henryiii commented 3 years ago

Because your dataset is small compared to the histogram size, doing Hist.from_columns(x,["col1", "col2", "col3"]) then playing with that is probably much better than trying to do the whole thing then taking projections anyway. If your dataset was large, and your histogram was small, then doing in in that order would be much better.

eduardo-rodrigues commented 3 years ago

Hi @reckoner, just a thought - what if you started by visualising data in high-ish dimensions to find out patterns, see https://inspirehep.net/literature/1832216 for example? Also I wonder what the value is of not making profile histograms to check what variables may be more constant than others or follow some rather simple model, to effectively then "factor those out"?

jpivarski commented 3 years ago

The CSV file has 23 columns, and each column has 2 through 12 unique values: mean of 5.2, stdev of 3.1. Say you actually have 400 columns but roughly the same number of unique values in each column. If you were to split the categorical values into indicator values (0 or 1 for each unique value), that would be most likely 2080 boolean variables, at most (95% C.L.!) 4560 boolean variables. As a bitmask, it's 260 (at most 570) bytes per row. Unless you have more than 20 million rows, representing the whole thing in memory won't be a problem.

This is sounding like an excellent candidate for machine learning, maybe some clustering or other unsupervised learning if you don't have an objective function, decision trees or neural nets if you do—and not manually pawing around with histograms. Maybe also the high-dimensional visualization that @eduardo-rodrigues recommended, or self-organizing maps or principle component analysis to bring it down to those 1‒4 quantities (likely a combination of many columns in each quantity, linear or non-linear).

You got so many people to chime in because it sounds like an interesting problem: it's the kind of thing where you can just open up the Scikit-Learn documentation and go shopping for a technique.

reckoner commented 3 years ago

@henryiii Excellent suggestion. I was just pondering that option. @eduardo-rodrigues Thanks for the reference. I will look into it. @jpivarski Are you saying that some of the bh.axis axes should be of type Boolean?

Let me look into those clustering techniques. I have done similar before but I need to preserve certain aspects for the statistical learning task at hand and not go too black-box. The ultimate consumers of this work are medical staff who need to be able to interpret and use the results in a non-research environment.

Thanks again, all!

jpivarski commented 3 years ago

It doesn't sound like a problem where histograms would help: they're best for reducing continuous variables (too much data: higher entropy than meaningful distinctions), but you have a huge number of non-continuous categories. So I'm not suggesting making boolean bh.axes because I'm not suggesting histograms at all.

Indicator variables are a common way of dealing with categorical data, especially if the number of unique values << the number of values, as it seems to be from your CSV file. Turning 1 column of N unique values into N boolean columns increases the number of columns even more, but not too much if the number of unique values is small. Then with all variables having boolean type, more ML techniques become applicable.

PCA, self-organizing maps, and maybe the thing @eduardo-rodrigues suggested (I read the abstract only) are more like "grey boxes," they reduce the data in a meaningful way but still show you images that you can inspect, rather than just throwing it into an ML black box and hoping for the best. But even ML techniques have ways of checking up on what they're doing. Clustering has "explained variance," for instance—you can tell if it's making meaningful clusters or not, even if it can't be visualized directly.

But visualizing 200‒400 dimensions is already a non-starter, so...