Memory efficiency problem

janpipek / physt

Python histogram library - histograms as updateable, fully semantic objects with visualization tools. [P]ython [HYST]ograms.

MIT License

129 stars 15 forks source link

Memory efficiency problem #33

Open janpipek opened 7 years ago

janpipek commented 7 years ago

When creating histogram from huge data, temporarily huge amount of memory is allocated, though no copy should be created.

Suspects:

dropna ???
weights

janpipek commented 7 years ago

Better in 0.3.26 where the overhead is lowered by 66 % (unnecessary flattens and weights)

Sh4zKh4n commented 4 years ago

What would you class as huge data? I am interested in using the package, looks like an interesting option. I have a 300GB ND array (3dim), that I have as a chunked dask array. Everytime I go to run a histogram, it falls over at the last hurdle. The only way I could see this working was dynamically updating a file. Which I started to write but for me it wasnt trivial! Until I noticed this package and others.

So I just wondered what (and in the notes) would class as big not large data?

janpipek commented 4 years ago

Hi @Sh4zKh4n , if you sequentially fill the histogram, you should not have a problem with file of any size, the memory problem was more related to processing one big chunk at a time (which is impossible in your case anyway ;-)). fill_n is your friend. You don't even need to know the number of bins in advance, as I document in https://github.com/janpipek/physt/blob/master/doc/adaptive_histogram.ipynb .

Let me know if you spot any problem or any ideas for improvement.

Sh4zKh4n commented 4 years ago

@janpipek So that's exactly the type of thing I was looking for. Is there a way to save to a file instead of holding in memory? Thanks I'll have a go at it later on my data set (once I take a break from daddy day care duties, trying to do ant coding with a 3yr old and 6 month old is a nigytmare.)

Sh4zKh4n commented 4 years ago

I should be clearer about that, what mean is to dynamically update a table like a pandas file with values? So you can come back to the analysis later and also keep the memory footprint down? Cheers I do appreciate the quick response

janpipek commented 4 years ago

If I understand correctly, you want to be able to calculate the histogram once (or in multiple steps) and then re-use it a few times. Sure, histograms can be stored and loaded to/from JSON format. Example how I would go with huge data is here: https://github.com/janpipek/physt/blob/master/doc/interrupted-workflow.ipynb Hope that answers your question :-)

Sh4zKh4n commented 4 years ago

so @janpipek , oh so that's nearly there, I kind of want to combine the two solutions you have , dynamic updating and saving to file!