Open janpipek opened 7 years ago
Better in 0.3.26 where the overhead is lowered by 66 % (unnecessary flattens and weights)
What would you class as huge data? I am interested in using the package, looks like an interesting option. I have a 300GB ND array (3dim), that I have as a chunked dask array. Everytime I go to run a histogram, it falls over at the last hurdle. The only way I could see this working was dynamically updating a file. Which I started to write but for me it wasnt trivial! Until I noticed this package and others.
So I just wondered what (and in the notes) would class as big not large data?
Hi @Sh4zKh4n , if you sequentially fill the histogram, you should not have a problem with file of any size, the memory problem was more related to processing one big chunk at a time (which is impossible in your case anyway ;-)). fill_n
is your friend. You don't even need to know the number of bins in advance, as I document in https://github.com/janpipek/physt/blob/master/doc/adaptive_histogram.ipynb .
Let me know if you spot any problem or any ideas for improvement.
@janpipek So that's exactly the type of thing I was looking for. Is there a way to save to a file instead of holding in memory? Thanks I'll have a go at it later on my data set (once I take a break from daddy day care duties, trying to do ant coding with a 3yr old and 6 month old is a nigytmare.)
I should be clearer about that, what mean is to dynamically update a table like a pandas file with values? So you can come back to the analysis later and also keep the memory footprint down? Cheers I do appreciate the quick response
If I understand correctly, you want to be able to calculate the histogram once (or in multiple steps) and then re-use it a few times. Sure, histograms can be stored and loaded to/from JSON format. Example how I would go with huge data is here: https://github.com/janpipek/physt/blob/master/doc/interrupted-workflow.ipynb Hope that answers your question :-)
so @janpipek , oh so that's nearly there, I kind of want to combine the two solutions you have , dynamic updating and saving to file!
When creating histogram from huge data, temporarily huge amount of memory is allocated, though no copy should be created.
Suspects: