jcmgray / xyzpy

Efficiently generate and analyse high dimensional data.
http://xyzpy.readthedocs.io
MIT License
67 stars 11 forks source link

Accessing intermediate results #7

Closed AdrianSosic closed 5 years ago

AdrianSosic commented 5 years ago

Hi jcmgray,

this package is really fantastics, it solves exactly the problems that I've been struggling with for years! Thank for your work!

I've just started using the package, though, and I have a question concerning batch processing: is there any straightforward way to access intermediate results of the computation by storing them on the disk? I've thought about two ways in particular: 1) Accessing the on-disk dataset created by the harvester. However, by default, the dataset is created only after all combos are evaluated. Is there some workaround / flag to set? 2) Using the crop functionality. However, I cannot reap the results during the computation since it gives the following error: This crop is not ready to reap yet - results are missing

Any thoughts?

jcmgray commented 5 years ago

@AdrianSosic

this package is really fantastics, it solves exactly the problems that I've been struggling with for years! Thank for your work!

Thanks! Glad its useful.


With regards to your query - it would definitely be a nice feature to have but nothing is implemented yet. For harvesting, if you want some kind of staged progress you could manually loop over some of values in combos:

combos = {'a': range(10), 'b': range(10)}
for a_val in combos.pop('a'):
    harvester.harvest_combos({**combos, 'a': [a_val]})

An kwarg like save_every={'a': 2} might be a nice api to do this automatically (i.e. slice the 'a' values in steps of 2).

On the other hand, Crop feels like the more natural place to put functionality relating to this kind of persistence. One approach here would be a reap method that defaults to an all-nan result for missing results and ofc doesn't delete the disk data afterwards. Everytime you ran this it would happily merge in the new data only.

Would either of those suit?

AdrianSosic commented 5 years ago

Hey again, thanks for your immediate answer!

Concerning the first idea: Yes, in principle it solves the problem. Yet, it comes with a number of drawbacks:

Concerning the second idea: Yes, this would be a perfectly suited solution. Are you planning to add this feature in the future? Nevertheless, it would be nice to have a simple solution like the first one in addition, especially for working only on one machine.

jcmgray commented 5 years ago

For the moment I think it would make sense to add this functionality to Crop, both for the reasons you list and I think just conceptually.

Should also be quite easy, the only slightly tricky part maybe is inferring the sequence of shapes of the all-nan result. I might try adding something in the next few days, unless you want to give it a try?

AdrianSosic commented 5 years ago

I wish I could contribute but, as I said, I've just started using xarray/xyzpy and don't feel confident enough to work on the underlying code since I have not yet fully understood all details of the packages =/ Maybe I can help in the future when I have more experience with them!

jcmgray commented 5 years ago

Of course no worries, I would like to use this functionality so will add shortly. If you have any preference/ideas for the API let me know. I was thinking of something along the lines of:

crop.reap(allow_incomplete=True)
jcmgray commented 5 years ago

@AdrianSosic I've added this functionality in https://github.com/jcmgray/xyzpy/commit/2ce489545c5dee86ae8d1cc576dfbc00f018de5b. If you get a chance, let me know if it's working for you.

AdrianSosic commented 5 years ago

Hey, thanks a lot for adding the functionality. I think the allow_incomplete option should be fine. One issue that could cause problems is when the function itself returns np.nan for some inputs. An alternative might be to use masked arrays (or to provide the option to choose the default value).

However, I am getting some unexpected behavior:

Any thoughts?

jcmgray commented 5 years ago

Hmm, if you could provide minimal examples that would be enormously helpful - as well as debugging they would be good starting points for unit tests.

With regard to using np.nan, this is really set because its what pandas and xarray use for missing data. Things like merging datasets would be much trickier if another value was used.

AdrianSosic commented 5 years ago

Sure, here is an example:

Run the following code to let the seeds grow:

import xyzpy as xyz
import xarray as xr
from time import sleep

def fn(a, b):
    if b == 3:
        sleep(10000)
    y = xr.Dataset({'sum': a+b, 'diff': a-b})
    return y

combos = dict(
    a=[1],
    b=[1, 2, 3]
)

runner = xyz.Runner(fn, var_names=None)
harvester = xyz.Harvester(runner, 'test.h5')
crop = harvester.Crop(name='fn', batchsize=1)
crop.sow_combos(combos)
for i in range(1, 4):
    crop.grow(i)

If you then, while the code is running, access the intermediate results via

import xyzpy as xyz

c = xyz.Crop(name='fn', batchsize=1)
X = c.reap(allow_incomplete=True)

you get as output a one-element tuple containing a tuple of Datasets.

Moreover, if you remove the kwarg batchsize=1 in the latter crop, you receive the error can't multiply sequence by non-int of type 'NoneType'.

jcmgray commented 5 years ago

Thanks for the example! I think both are fixed by automatically loading on-disk information if it exists for any new crop. I'll push an update once I have a test in place shortly.

jcmgray commented 5 years ago

@AdrianSosic, this should be fixed in https://github.com/jcmgray/xyzpy/commit/1eb2b1dd04f3ef15724aef745915ae118a493e37, let me know if it's not working for you.

AdrianSosic commented 5 years ago

@jcmgray, great, seems to work perfectly, thanks! When I encounter any other issues, I will let you know!