h2oai / wave-ml

Automatic Machine Learning (AutoML) for Wave Apps
Apache License 2.0
32 stars 5 forks source link

Feedback: h2o_wave.ml #21

Closed mtanco closed 3 years ago

mtanco commented 3 years ago

Started here: https://github.com/h2oai/wave/blob/feat/common-api/py/examples/h2o3.py without reading extensive docs or anything, just feedback from initial usage.

System Setup

git clone https://github.com/h2oai/wave.git
cd wave
git fetch
git checkout feat/common-api

cd py
make setup

./venv/bin/pip install h2o
./venv/bin/pip install datatable

./venv/bin/pip install ipykernel
./venv/bin/python -m  ipykernel install

 ./venv/bin/jupyter notebook    

# Server is ./waved from the GitHub release 0.9.1 which is maybe the wrong thing to do?

Notes

  1. Inconsistency of data interactions build_model(train_path, target=target) Build model requires a file path (pandas is not an option) but predict = model.predict(test_df) cannot take a path and has to be the actual data. I don't personally care which but all methods should take the same "type" of data whether that is a file path or data object

  2. Example didn't work I had data type issues when predicting with a datatable to tuple object: predict = model.predict(test_dt.to_tuples())

    OSError: Job with key $03017f00000132d4ffffffff$_93bf92eec9079a23639ed057aaa44ec3 failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has categorical column 'Unemployment' which is real-valued in the training data
    stacktrace: 
    java.lang.IllegalArgumentException: Test/Validation dataset has categorical column 'Unemployment' which is real-valued in the training data

    Switching to pandas fixed this but we should be careful with enum vs. numeric handling to avoid these types of errors

  3. What makes this modeling package easier? I've only tried this one example at this point, but at this point I am not seeing why this is easier for me to use than "regular" AutoML. Perhaps it's more about when DAI is in the mix too.

  4. We need to make visualizing easier.

page = site['/predictions']

n_train = 10
n_test = 10
n_total = n_train + n_test
v = page.add('example', ui.plot_card(
    box='1 1 4 5',
    title='Line',
    data=data('date price', n_total),
    plot=ui.plot([ui.mark(type='line', x_scale='time', x='=date', y='=price', y_min=0)])
))

# We are taking the last `n_train` values from the train set.
values_train = [(train_dt[-i, 'Date'], train_dt[-i, 'Fuel_Price']) for i in reversed(range(1, n_train + 1))]
values_test = [(test_dt[i, 'Date'], predict[i][0]) for i in range(n_test)]

v.data = values_train + values_test

page.save()

This is a lot. I am knew to the whole data buffers thing, but I can't just look at this and know how to use it on my data. It is going to take some time to sit-down and understand what values_train is so that I can figure out how to show data in cards how I want to.

I personally see Predictions -> Wave App as the most important differentiator. Our customers can build models, let's make the "turn models into beautiful apps" part even easier.

I will read the docs and do more work now, but initial thoughts

geomodular commented 3 years ago

Thank you @mtanco for your time and review!

Server is ./waved from the GitHub release 0.9.1 which is maybe the wrong thing to do?

Should be ok in terms of the API but will be doing rebase to keep up with the head.

Inconsistency of data interactions

I had it both file paths at first but changed it. Agree both functions have different data inputs and should be normalised. I suppose it would be even better to allow several sources (python, pandas, file path, datatable) for both functions, is that right?

Example didn't work ... Switching to pandas fixed this

Oh I see: Test/Validation dataset has categorical column 'Unemployment' which is real-valued in the training data you had different type in train and test datasets. Because of the split maybe?

What did you do with pandas that solved it out?

What makes this modeling package easier? Perhaps it's more about when DAI is in the mix too.

At this point it has no extra value. From my perspective; if wave app deployed to the cloud the API will take care of provisioning DAI for you. The process will be hidden behind this simple API calls. Otherwise, DS would have been exposed to provisioning API which is not their task to take care of.

There also will be deploy() function not covered for H2O-3 and also I'm starting to integrate DAI so one API will cover both backends.

We need to make visualizing easier. ... I can't just look at this and know how to use it on my data.

Agree, it looks cryptic and it was inconvenient to write. There are two aspects here:

  1. The inputs to ui.plot_card() should follow structure declared here data('date price', n_total).
  2. The conversion from datatable to the structure declared in 1.

But not sure what should be changed. For the first bullet, this is how we declare and what we expect on the input in Wave in general. For the second bullet we might provide some useful functions later e.g.: tail(train_dt, 50, 'Date', 'Fuel_Price') will give you last 50 rows.

cc @lo5

lo5 commented 3 years ago

I should make this more clear in the documentation.

You only need the size in the data(fields=..., size=...) usage if you intend to use data buffers, and data buffers are useful only for situations where you need low-latency monitoring.

In the above example, since all you want to do is display a static plot, you can simplify it to:

page = site['/predictions']
v = page.add('example', ui.plot_card(
    box='1 1 4 5',
    title='Line',
    data=data('date price', rows=values_train + values_test),
    plot=ui.plot([ui.mark(type='line', x_scale='time', x='=date', y='=price', y_min=0)])
))

or (and this should be quicker for column-oriented frames like in pandas or dt):

page = site['/predictions']
v = page.add('example', ui.plot_card(
    box='1 1 4 5',
    title='Line',
    data=data('date price', columns= [df[col].to_list() for col in df.columns]),
    plot=ui.plot([ui.mark(type='line', x_scale='time', x='=date', y='=price', y_min=0)])
))
mtanco commented 3 years ago

@geomodular I used the walmart_train.csv and walmart_test.csv standard datasets to match what was in the example notebook. I did not do anything special in regards to loading into pandas vs. datatable, so I don't have a good idea of what the problem might have been

Here is my notebook of working through your notebook: https://github.com/h2oai/wave/blob/MT/common-api-testing/py/mt_ml_testing.ipynb

And the major notes from it:

At this point it has no extra value. From my perspective; if wave app deployed to the cloud the API will take care of provisioning DAI for you. The process will be hidden behind this simple API calls. Otherwise, DS would have been exposed to provisioning API which is not their task to take care of.

This makes total sense and is nice. It was also nice that it spun up h2o-3 for me automatically. I think some advanced users might want to have a say in which version of H2o-3 or DAI is used. DAI especially for LTS vs. bleeding edge

geomodular commented 3 years ago

@mtanco I updated the branch and the example. It's simpler. Please take a look if have a time.

model.predict(test_dt.to_tuples()) failed for me

I made some fixes there, please take a look.

model.predict(test_path) doesn't work which is inconsistent

This works now, model.predict() is able to swallow python object and filepath as well.

There doesn't seem to be much to do with the model object, I would like to see the confusion matrix or standard regression validation metrics Model.Type only says it's H2O-3, I want to know what algo it is (gbm, rf, etc.)

I talked to @lo5 about this and the plan is to develop model.explain() call. The output of the function is yet to be determined.

I'm also getting weird predictions sometimes (not sure what's happening). I hope you will see it and tell me what's happening instantly 😂 .

mtanco commented 3 years ago

New feedback (should I make these each separate issues?)

Should h2o_wave come with h2o? Or do we need to have h2o in the requirements.txt of examples? image

Does this doc string mean model type like "h2o-3" vs. "Driverless AI"? As an analytics person model type to me means "which machine learning algorithm" like Random Forest vs. XGBoost. WDYT about "model_environment" or something like that?

If model_type not specified the function will determine correct model based on a current environment.

In general, this model is going to be bad because this is a time series problem but H2O AutoML doesn't do that automatically. So normally you would look at the average past sales to predict the future, but h2o-3 only looks at the current row.

image

This chart looks funky though, and I think it's not the predicts but that each date has multiple rows (one per store). but I am still looking into that. I would be easier to tell once that explain is available and we can see things like model.varimp() which would show us import variables in the model

Have a few more comments but I think they will be a stand-alone-issues and need more thought first.

geomodular commented 3 years ago

New feedback (should I make these each separate issues?)

Up to you, but I think this thread is good enough.

Should h2o_wave come with h2o? Or do we need to have h2o in the requirements.txt of examples?

This will be handled by h2o SDK. If user wants to install predictive background they should include ml option:

pip install h2o-wave[ml]

Does this doc string mean model type like "h2o-3" vs. "Driverless AI"?

Exactly.

As an analytics person model type to me means "which machine learning algorithm" like Random Forest vs. XGBoost. WDYT about "model_environment" or something like that?

I see, will think about it.

Thanks @mtanco !

geomodular commented 3 years ago

In general, this model is going to be bad because this is a time series problem but H2O AutoML doesn't do that automatically.

Is time series a problem that build_model() has to know about it? Or should it be recognized automagically? Should we cover it at all? cc @lo5

This chart looks funky though, and I think it's not the predicts but that each date has multiple rows (one per store).

I see. Let's just use dataset and a problem you are certain will work. Can you suggest something I can use in this, internal, example?

I have one other question. Is it a standard that test data doesn't contain target/predicted column at all? I'm not asking because of DS point of view but rather technically. I think we need to cover both cases.

geomodular commented 3 years ago

Feedback and thread were very useful and served the purpose.

Closing, Thanks @mtanco and @lo5!