vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] vaex.dataframe.DataFrameLocal mean and then increase memroy #1308

Open sungreong opened 3 years ago

sungreong commented 3 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

data.mean(columns)
[data[col].mean() for col in columns] 
data.mean(columns, delay=True)

i just want to get statistics, but increase memory...

i don't understand this...

Software information

Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).

JovanVeljanoski commented 3 years ago

Hi,

You probably just want to do df.mean(columns)

Might be a good idea to read through the tutorialto get a better sense of how vaex works. It is not the same as pandas all the time, even though many methods have the same names etc.. it is a fully different implementation with different concepts.

Also, docstrings are your friend.

sungreong commented 3 years ago

Hi

So what I'm curious about is how do I change the state of being in-memory back to a state that doesn't use memory?

df.mean(columns) # increate memory 6GB
??  ## 6GB -> 0GB  HOW TO DELETE CACHE?
JovanVeljanoski commented 3 years ago

Well you want to have your data in hdf5/arrow file format (parquet works too).

Vaex will cache some data to speed up computations. But that is just temporary. If the OS needs the memory for whatever reason, it will be released right away.

Also, for questions about usage, tips etc.. maybe it is better to use the discussion board ?

sungreong commented 3 years ago

Thank you. So what you're saying is that memory is used for a while, and the cache is automatically saved after a while?

Oh I didn't know the existence of the board. Where is the board?

Leo-Ji2020 commented 3 years ago

Although vaex use memory temporary, however, when it consumes more memory than the machine acutally have, the job will be killed and can not run successfully. I have submit this ISSUE several days before [https://github.com/vaexio/vaex/issues/1304]. Now, the solution I have taken is to cut my data domain into several small pieces, then calculate them one by one, and concatenate their results.

sungreong commented 3 years ago

Sorry, can I get an example sample code?

Leo-Ji2020 commented 3 years ago

I'm not sure my example could meet your problem. I just describe my thought. Suppose you have a 2D domian with size (4000, 2000), and the result in each grid has not relation to other grid, then you can cut the domain into small pieces in one direction, like (500, 2000). I wrote a function to do the calculation. Suppose you want to cancluate the meidan value in each grid, here is an example.

`def calc_slice_combination(Nslice, dataset, varname, bins, lim_slice, shp, selection=False, dtype='float32'): arr = np.concatenate([dataset.median_approx(dataset[varname], binby=bins, limits=lims, shape=shp, selection=selection).astype(dtype) for islice in np.arange(Nslice)], axis=0)
return arr

size = [4000, 2000] single_len = 500 Nslice = size[0]/single_len bins=['lon', 'lat'] lims = [[0, size[0]-0.9999], [0, size[1]-0.9999]] # you should add sigma=-0.9999 for the last dim, it used for group your data in your grid. You can not use size[0]-1 since the last row of your data will not be included in statistics. lim_slice = [[[lims[0][0]+idxintl_xlim, (idx+1)intl_xlim-1+0.001], lims[1]] for idx in np.arange(Nslice)] shp = (single_len, size[1]) calc_slice_combination(Nslice, YOUR_DATASET, VARIABLE_COLUMN_NAME_IN_YOUR_DATASET, bins, lim_slice, shp) ` Hope it can help.