Closed AdrianSosic closed 5 years ago
@AdrianSosic
this package is really fantastics, it solves exactly the problems that I've been struggling with for years! Thank for your work!
Thanks! Glad its useful.
With regards to your query - it would definitely be a nice feature to have but nothing is implemented yet. For harvesting, if you want some kind of staged progress you could manually loop over some of values in combos
:
combos = {'a': range(10), 'b': range(10)}
for a_val in combos.pop('a'):
harvester.harvest_combos({**combos, 'a': [a_val]})
An kwarg like save_every={'a': 2}
might be a nice api to do this automatically (i.e. slice the 'a' values in steps of 2).
On the other hand, Crop
feels like the more natural place to put functionality relating to this kind of persistence. One approach here would be a reap
method that defaults to an all-nan result for missing results and ofc doesn't delete the disk data afterwards. Everytime you ran this it would happily merge in the new data only.
Would either of those suit?
Hey again, thanks for your immediate answer!
Concerning the first idea: Yes, in principle it solves the problem. Yet, it comes with a number of drawbacks:
case
, which would require as many external loops as there are dimensions in order to cover the entire product space of combinations
--> in this case, there would be no more reason to use the package since its purpose is exactly to take over this taskConcerning the second idea: Yes, this would be a perfectly suited solution. Are you planning to add this feature in the future? Nevertheless, it would be nice to have a simple solution like the first one in addition, especially for working only on one machine.
For the moment I think it would make sense to add this functionality to Crop
, both for the reasons you list and I think just conceptually.
Should also be quite easy, the only slightly tricky part maybe is inferring the sequence of shapes of the all-nan result. I might try adding something in the next few days, unless you want to give it a try?
I wish I could contribute but, as I said, I've just started using xarray/xyzpy and don't feel confident enough to work on the underlying code since I have not yet fully understood all details of the packages =/ Maybe I can help in the future when I have more experience with them!
Of course no worries, I would like to use this functionality so will add shortly. If you have any preference/ideas for the API let me know. I was thinking of something along the lines of:
crop.reap(allow_incomplete=True)
@AdrianSosic I've added this functionality in https://github.com/jcmgray/xyzpy/commit/2ce489545c5dee86ae8d1cc576dfbc00f018de5b. If you get a chance, let me know if it's working for you.
Hey, thanks a lot for adding the functionality. I think the allow_incomplete
option should be fine. One issue that could cause problems is when the function itself returns np.nan
for some inputs. An alternative might be to use masked arrays (or to provide the option to choose the default value).
However, I am getting some unexpected behavior:
Crop
object (e.g. one that was created on a different machine), I need to explicitly pass the correct batchsize
information again. Otherwise, I get the error can't multiply sequence by non-int of type 'NoneType'
. Shouldn't the batchsize be automatically loaded from the sown combos on the disk?c.reap(allow_incomplete=True)
, I get a tuple of Datasets instead of a single merged Dataset.Any thoughts?
Hmm, if you could provide minimal examples that would be enormously helpful - as well as debugging they would be good starting points for unit tests.
With regard to using np.nan
, this is really set because its what pandas
and xarray
use for missing data. Things like merging datasets would be much trickier if another value was used.
Sure, here is an example:
Run the following code to let the seeds grow:
import xyzpy as xyz
import xarray as xr
from time import sleep
def fn(a, b):
if b == 3:
sleep(10000)
y = xr.Dataset({'sum': a+b, 'diff': a-b})
return y
combos = dict(
a=[1],
b=[1, 2, 3]
)
runner = xyz.Runner(fn, var_names=None)
harvester = xyz.Harvester(runner, 'test.h5')
crop = harvester.Crop(name='fn', batchsize=1)
crop.sow_combos(combos)
for i in range(1, 4):
crop.grow(i)
If you then, while the code is running, access the intermediate results via
import xyzpy as xyz
c = xyz.Crop(name='fn', batchsize=1)
X = c.reap(allow_incomplete=True)
you get as output a one-element tuple containing a tuple of Datasets.
Moreover, if you remove the kwarg batchsize=1
in the latter crop, you receive the error can't multiply sequence by non-int of type 'NoneType'
.
Thanks for the example! I think both are fixed by automatically loading on-disk information if it exists for any new crop. I'll push an update once I have a test in place shortly.
@AdrianSosic, this should be fixed in https://github.com/jcmgray/xyzpy/commit/1eb2b1dd04f3ef15724aef745915ae118a493e37, let me know if it's not working for you.
@jcmgray, great, seems to work perfectly, thanks! When I encounter any other issues, I will let you know!
Hi jcmgray,
this package is really fantastics, it solves exactly the problems that I've been struggling with for years! Thank for your work!
I've just started using the package, though, and I have a question concerning batch processing: is there any straightforward way to access intermediate results of the computation by storing them on the disk? I've thought about two ways in particular: 1) Accessing the on-disk dataset created by the harvester. However, by default, the dataset is created only after all combos are evaluated. Is there some workaround / flag to set? 2) Using the crop functionality. However, I cannot reap the results during the computation since it gives the following error:
This crop is not ready to reap yet - results are missing
Any thoughts?