Open sungreong opened 3 years ago
Hi,
You probably just want to do
df.mean(columns)
Might be a good idea to read through the tutorialto get a better sense of how vaex works. It is not the same as pandas all the time, even though many methods have the same names etc.. it is a fully different implementation with different concepts.
Also, docstrings are your friend.
Hi
So what I'm curious about is how do I change the state of being in-memory back to a state that doesn't use memory?
df.mean(columns) # increate memory 6GB
?? ## 6GB -> 0GB HOW TO DELETE CACHE?
Well you want to have your data in hdf5/arrow file format (parquet works too).
Vaex will cache some data to speed up computations. But that is just temporary. If the OS needs the memory for whatever reason, it will be released right away.
Also, for questions about usage, tips etc.. maybe it is better to use the discussion board ?
Thank you. So what you're saying is that memory is used for a while, and the cache is automatically saved after a while?
Oh I didn't know the existence of the board. Where is the board?
Although vaex use memory temporary, however, when it consumes more memory than the machine acutally have, the job will be killed and can not run successfully. I have submit this ISSUE several days before [https://github.com/vaexio/vaex/issues/1304]. Now, the solution I have taken is to cut my data domain into several small pieces, then calculate them one by one, and concatenate their results.
Sorry, can I get an example sample code?
I'm not sure my example could meet your problem. I just describe my thought. Suppose you have a 2D domian with size (4000, 2000), and the result in each grid has not relation to other grid, then you can cut the domain into small pieces in one direction, like (500, 2000). I wrote a function to do the calculation. Suppose you want to cancluate the meidan value in each grid, here is an example.
`def calc_slice_combination(Nslice, dataset, varname, bins, lim_slice, shp,
selection=False, dtype='float32'):
arr = np.concatenate([dataset.median_approx(dataset[varname], binby=bins, limits=lims, shape=shp, selection=selection).astype(dtype) for islice in np.arange(Nslice)], axis=0)
return arr
size = [4000, 2000] single_len = 500 Nslice = size[0]/single_len bins=['lon', 'lat'] lims = [[0, size[0]-0.9999], [0, size[1]-0.9999]] # you should add sigma=-0.9999 for the last dim, it used for group your data in your grid. You can not use size[0]-1 since the last row of your data will not be included in statistics. lim_slice = [[[lims[0][0]+idxintl_xlim, (idx+1)intl_xlim-1+0.001], lims[1]] for idx in np.arange(Nslice)] shp = (single_len, size[1]) calc_slice_combination(Nslice, YOUR_DATASET, VARIABLE_COLUMN_NAME_IN_YOUR_DATASET, bins, lim_slice, shp) ` Hope it can help.
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
i just want to get statistics, but increase memory...
i don't understand this...
Software information
import vaex; vaex.__version__)
: vaex-core 4.1.0 vaex-hdf5 0.7.0 vaex-ml 0.11.1Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).