Accessing intermediate results

AdrianSosic commented 5 years ago

Hi jcmgray,

this package is really fantastics, it solves exactly the problems that I've been struggling with for years! Thank for your work!

I've just started using the package, though, and I have a question concerning batch processing: is there any straightforward way to access intermediate results of the computation by storing them on the disk? I've thought about two ways in particular: 1) Accessing the on-disk dataset created by the harvester. However, by default, the dataset is created only after all combos are evaluated. Is there some workaround / flag to set? 2) Using the crop functionality. However, I cannot reap the results during the computation since it gives the following error: This crop is not ready to reap yet - results are missing

Any thoughts?

jcmgray commented 5 years ago

@AdrianSosic

this package is really fantastics, it solves exactly the problems that I've been struggling with for years! Thank for your work!

Thanks! Glad its useful.

With regards to your query - it would definitely be a nice feature to have but nothing is implemented yet. For harvesting, if you want some kind of staged progress you could manually loop over some of values in combos:

combos = {'a': range(10), 'b': range(10)}
for a_val in combos.pop('a'):
    harvester.harvest_combos({**combos, 'a': [a_val]})

An kwarg like save_every={'a': 2} might be a nice api to do this automatically (i.e. slice the 'a' values in steps of 2).

On the other hand, Crop feels like the more natural place to put functionality relating to this kind of persistence. One approach here would be a reap method that defaults to an all-nan result for missing results and ofc doesn't delete the disk data afterwards. Everytime you ran this it would happily merge in the new data only.

Would either of those suit?

AdrianSosic commented 5 years ago

Hey again, thanks for your immediate answer!

Concerning the first idea: Yes, in principle it solves the problem. Yet, it comes with a number of drawbacks:

It's an external workaround, making the code less compact again and thus reducing the benefits of using xyzpy in the first place --> the suggested kwarg option would be nice
Also, in many scenarios, I would like to be able to access the results after each case, which would require as many external loops as there are dimensions in order to cover the entire product space of combinations --> in this case, there would be no more reason to use the package since its purpose is exactly to take over this task
More importantly, by external looping, I loose the built-in functionality of parallelization
If manually parallelize the external loop(s), am I guaranteed that the package handles the file access correctly, i.e., that there will be no data loss when the different processes write to the same file?

Concerning the second idea: Yes, this would be a perfectly suited solution. Are you planning to add this feature in the future? Nevertheless, it would be nice to have a simple solution like the first one in addition, especially for working only on one machine.

jcmgray commented 5 years ago

For the moment I think it would make sense to add this functionality to Crop, both for the reasons you list and I think just conceptually.

Should also be quite easy, the only slightly tricky part maybe is inferring the sequence of shapes of the all-nan result. I might try adding something in the next few days, unless you want to give it a try?

AdrianSosic commented 5 years ago

I wish I could contribute but, as I said, I've just started using xarray/xyzpy and don't feel confident enough to work on the underlying code since I have not yet fully understood all details of the packages =/ Maybe I can help in the future when I have more experience with them!

jcmgray commented 5 years ago

Of course no worries, I would like to use this functionality so will add shortly. If you have any preference/ideas for the API let me know. I was thinking of something along the lines of:

crop.reap(allow_incomplete=True)

jcmgray commented 5 years ago

@AdrianSosic I've added this functionality in https://github.com/jcmgray/xyzpy/commit/2ce489545c5dee86ae8d1cc576dfbc00f018de5b. If you get a chance, let me know if it's working for you.

AdrianSosic commented 5 years ago

Hey, thanks a lot for adding the functionality. I think the allow_incomplete option should be fine. One issue that could cause problems is when the function itself returns np.nan for some inputs. An alternative might be to use masked arrays (or to provide the option to choose the default value).

However, I am getting some unexpected behavior:

One thing that is a bit weird is that, when I try to grow some sown combos using a new Crop object (e.g. one that was created on a different machine), I need to explicitly pass the correct batchsize information again. Otherwise, I get the error can't multiply sequence by non-int of type 'NoneType'. Shouldn't the batchsize be automatically loaded from the sown combos on the disk?
When I load the incomplete result using c.reap(allow_incomplete=True), I get a tuple of Datasets instead of a single merged Dataset.

Any thoughts?

jcmgray commented 5 years ago

Hmm, if you could provide minimal examples that would be enormously helpful - as well as debugging they would be good starting points for unit tests.

With regard to using np.nan, this is really set because its what pandas and xarray use for missing data. Things like merging datasets would be much trickier if another value was used.

AdrianSosic commented 5 years ago

Sure, here is an example:

Run the following code to let the seeds grow:

import xyzpy as xyz
import xarray as xr
from time import sleep

def fn(a, b):
    if b == 3:
        sleep(10000)
    y = xr.Dataset({'sum': a+b, 'diff': a-b})
    return y

combos = dict(
    a=[1],
    b=[1, 2, 3]
)

runner = xyz.Runner(fn, var_names=None)
harvester = xyz.Harvester(runner, 'test.h5')
crop = harvester.Crop(name='fn', batchsize=1)
crop.sow_combos(combos)
for i in range(1, 4):
    crop.grow(i)

If you then, while the code is running, access the intermediate results via

import xyzpy as xyz

c = xyz.Crop(name='fn', batchsize=1)
X = c.reap(allow_incomplete=True)

you get as output a one-element tuple containing a tuple of Datasets.

Moreover, if you remove the kwarg batchsize=1 in the latter crop, you receive the error can't multiply sequence by non-int of type 'NoneType'.

jcmgray commented 5 years ago

Thanks for the example! I think both are fixed by automatically loading on-disk information if it exists for any new crop. I'll push an update once I have a test in place shortly.

jcmgray commented 5 years ago

@AdrianSosic, this should be fixed in https://github.com/jcmgray/xyzpy/commit/1eb2b1dd04f3ef15724aef745915ae118a493e37, let me know if it's not working for you.

AdrianSosic commented 5 years ago

@jcmgray, great, seems to work perfectly, thanks! When I encounter any other issues, I will let you know!

jcmgray / xyzpy

Accessing intermediate results #7