logicalclocks / maggy

Distribution transparent Machine Learning experiments on Apache Spark
https://maggy.ai
Apache License 2.0
89 stars 14 forks source link

lagom / executor / driver refactor #89

Closed amacati closed 3 years ago

amacati commented 3 years ago

Refactor the experiment function, executors and drivers.

This PR seeks to disentangle hyperparameter optimization, ablation studies and distributed training code. With the addition of more features to Maggy the code becomes confusing if it is not split into distinct units. As a first step, the lagom function and associated functionality was rewritten. Since the implications of changing the lagom function concern the whole module, the changes are substantial. In order to make changes retraceable, I give a short summary of each file with the respective changes. It will be easiest to check the files of the PR in the order they are given in this summary.

Disentangling the lagom function:

Refactoring the drivers

Miscellaneous changes

Known issues Ablation tests are still ongoing, so far there seems to be an issue with the heartbeat.

RiccardoGrigoletto commented 3 years ago

It looks good to me, I tried it with tensorflow in my vm and works.

amacati commented 3 years ago

Why were there still calls to hopsutils and experiment_utils in the code? I converted those calls to EnvSing, pretty sure..

RiccardoGrigoletto commented 3 years ago

It was in the first commit then you change them in the next commit. I didn't see it at the beginning but then I saw it and marked the comments as 'resolved'

amacati commented 3 years ago

Gridsearch is tested and should work.