logicalclocks / maggy

Distribution transparent Machine Learning experiments on Apache Spark
https://maggy.ai
Apache License 2.0
89 stars 14 forks source link

Is maggy applicable to my use case? #69

Open blazejdolicki opened 3 years ago

blazejdolicki commented 3 years ago

Hi, I've just found this library and it seems great, but wanted to quickly double-check if it's applicable to my use case. Namely, I have a large amount of tabular data stored in Spark DataFrames (so the data is distributed on multiple machines) on databricks and I'm using a Spark ML model. Will I be able to run trials in parallel with such setting using maggy?

moritzmeister commented 3 years ago

Hey!

Thanks for your interest! Maggy is very applicable to your use case, however, at this point in time it is very much tied to Hopsworks. If you want to try it out on Hopsworks, you can get access to a free demo instance on hopsworks.ai or you can deploy an entire Hopsworks instance to your own AWS account.

We are working on making Maggy more general, but it will take few more weeks for it to be ready for use on any Spark Cluster. We are planning to release a standalone version of Maggy for the Data+AI Summit Europe in two weeks, or shortly thereafter.

Please come back and check the repo for any new releases!

In the meantime if you want to know more about Maggy as a research project, we have some blogposts (here and here. And also a paper at the MLOps Workshop of this years MLSys conference.

Hope that answers your questions! I will ping you here, once we made another release!

blazejdolicki commented 3 years ago

Thanks for a comprehensive response and all the references. In this case, I will wait for the standalone version. Looking forward to it!

blazejdolicki commented 3 years ago

Hi @moritzmeister, I was wondering what's the status of Maggy, did you manage to make it standalone? :)

crakama commented 3 years ago

@moritzmeister I am also interested on this response.

moritzmeister commented 3 years ago

Hi @blazejdolicki, @crakama! Thanks for your interest! We're working on it but it's not there yet. Hope to get it done by mid January.

crakama commented 3 years ago

@moritzmeister Thank you for the response. Will it be published somewhere?

crakama commented 3 years ago

Any updates on this?

moritzmeister commented 3 years ago

Hey @crakama, sorry I must've missed your previous message! With increasing interest we are now working towards a major 1.0 release. So we are getting there but it will still take some time.

moritzmeister commented 3 years ago

Hi @blazejdolicki, hi @crakama,

it took us a while, but we just published a 1.0.0rc0 release candidate on PyPi. Give it a try on your Spark clusters. I suspect there are still a few bugs, but we are working on fixing them until the main release.

To get people started, there are a bunch more example notebooks now in https://github.com/logicalclocks/maggy/tree/master/examples

Also we are redesigning/rewriting the documentation, so keep an eye on www.maggy.ai :) And we wrote some Blog posts, which can serve as a kind of documentation until then:

Feel free to open new issues here on GitHub if you have specific questions or encounter any bugs. I am closing this issue here. Looking forward to any feedback you might have!

blazejdolicki commented 3 years ago

Thanks for letting us know!

crakama commented 3 years ago

Hi, Unfortunately when I install maggy from pip I get this error : ERROR: Could not find a version that satisfies the requirement maggy==1.0.0rc0 (from versions: 0.0.1, 0.1, 0.1.1, 0.2, 0.2.1, 0.2.2, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.4.0, 0.4.1, 0.4.2, 0.5.0, 0.5.1, 0.5.2, 0.5.3)

ERROR: No matching distribution found for maggy==1.0.0rc0

On Tue, May 25, 2021 at 6:35 PM blazejdolicki @.***> wrote:

Thanks for letting us know!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/logicalclocks/maggy/issues/69#issuecomment-848034372, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7UVCVDKY4363B7ONLOWDLTPPGVBANCNFSM4TEX2O3Q .

moritzmeister commented 3 years ago

Hi @crakama,

that is very strange, what kind of python environment are you using and which version of pip? The release is on pypi: https://pypi.org/project/maggy/1.0.0rc0/. Alternatively, I also uploaded the wheel to the release tag here on github https://github.com/logicalclocks/maggy/releases/tag/1.0.0rc0

I just tested it with Python 3.8.8 and Pip 21.1.1 and I can install it as expected:

(test-maggy) moritzmeister@ ~ () $ pip install maggy==1.0.0rc0
Collecting maggy==1.0.0rc0
  Downloading maggy-1.0.0rc0-py3-none-any.whl (154 kB)
     |████████████████████████████████| 154 kB 8.2 MB/s 
Collecting scikit-optimize==0.7.4
  Using cached scikit_optimize-0.7.4-py2.py3-none-any.whl (80 kB)
Collecting numpy==1.19.2
  Using cached numpy-1.19.2-cp38-cp38-manylinux2010_x86_64.whl (14.5 MB)
Collecting statsmodels==0.11.0
  Using cached statsmodels-0.11.0-cp38-cp38-manylinux1_x86_64.whl (8.7 MB)
Collecting scipy==1.4.1
  Using cached scipy-1.4.1-cp38-cp38-manylinux1_x86_64.whl (26.0 MB)
Collecting scikit-learn>=0.19.1
  Downloading scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl (24.9 MB)
     |████████████████████████████████| 24.9 MB 13.3 MB/s 
Collecting pyaml>=16.9
  Using cached pyaml-20.4.0-py2.py3-none-any.whl (17 kB)
Collecting joblib>=0.11
  Using cached joblib-1.0.1-py3-none-any.whl (303 kB)
Collecting patsy>=0.5
  Using cached patsy-0.5.1-py2.py3-none-any.whl (231 kB)
Collecting pandas>=0.21
  Downloading pandas-1.2.4-cp38-cp38-manylinux1_x86_64.whl (9.7 MB)
     |████████████████████████████████| 9.7 MB 8.5 MB/s 
Collecting python-dateutil>=2.7.3
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pytz>=2017.3
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Collecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting PyYAML
  Using cached PyYAML-5.4.1-cp38-cp38-manylinux1_x86_64.whl (662 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Installing collected packages: six, numpy, threadpoolctl, scipy, PyYAML, pytz, python-dateutil, joblib, scikit-learn, pyaml, patsy, pandas, statsmodels, scikit-optimize, maggy
Successfully installed PyYAML-5.4.1 joblib-1.0.1 maggy-1.0.0rc0 numpy-1.19.2 pandas-1.2.4 patsy-0.5.1 pyaml-20.4.0 python-dateutil-2.8.1 pytz-2021.1 scikit-learn-0.24.2 scikit-optimize-0.7.4 scipy-1.4.1 six-1.16.0 statsmodels-0.11.0 threadpoolctl-2.1.0