Closed mtanco closed 3 years ago
Thank you @mtanco for your time and review!
Server is ./waved from the GitHub release 0.9.1 which is maybe the wrong thing to do?
Should be ok in terms of the API but will be doing rebase to keep up with the head.
Inconsistency of data interactions
I had it both file paths at first but changed it. Agree both functions have different data inputs and should be normalised. I suppose it would be even better to allow several sources (python, pandas, file path, datatable) for both functions, is that right?
Example didn't work ... Switching to pandas fixed this
Oh I see: Test/Validation dataset has categorical column 'Unemployment' which is real-valued in the training data
you had different type in train and test datasets. Because of the split maybe?
What did you do with pandas that solved it out?
What makes this modeling package easier? Perhaps it's more about when DAI is in the mix too.
At this point it has no extra value. From my perspective; if wave app deployed to the cloud the API will take care of provisioning DAI for you. The process will be hidden behind this simple API calls. Otherwise, DS would have been exposed to provisioning API which is not their task to take care of.
There also will be deploy()
function not covered for H2O-3 and also I'm starting to integrate DAI so one API will cover both backends.
We need to make visualizing easier. ... I can't just look at this and know how to use it on my data.
Agree, it looks cryptic and it was inconvenient to write. There are two aspects here:
ui.plot_card()
should follow structure declared here data('date price', n_total)
.But not sure what should be changed. For the first bullet, this is how we declare and what we expect on the input in Wave in general. For the second bullet we might provide some useful functions later e.g.: tail(train_dt, 50, 'Date', 'Fuel_Price')
will give you last 50 rows.
cc @lo5
I should make this more clear in the documentation.
You only need the size
in the data(fields=..., size=...)
usage if you intend to use data buffers, and data buffers are useful only for situations where you need low-latency monitoring.
In the above example, since all you want to do is display a static plot, you can simplify it to:
page = site['/predictions']
v = page.add('example', ui.plot_card(
box='1 1 4 5',
title='Line',
data=data('date price', rows=values_train + values_test),
plot=ui.plot([ui.mark(type='line', x_scale='time', x='=date', y='=price', y_min=0)])
))
or (and this should be quicker for column-oriented frames like in pandas or dt):
page = site['/predictions']
v = page.add('example', ui.plot_card(
box='1 1 4 5',
title='Line',
data=data('date price', columns= [df[col].to_list() for col in df.columns]),
plot=ui.plot([ui.mark(type='line', x_scale='time', x='=date', y='=price', y_min=0)])
))
@geomodular I used the walmart_train.csv
and walmart_test.csv
standard datasets to match what was in the example notebook. I did not do anything special in regards to loading into pandas vs. datatable, so I don't have a good idea of what the problem might have been
Here is my notebook of working through your notebook: https://github.com/h2oai/wave/blob/MT/common-api-testing/py/mt_ml_testing.ipynb
And the major notes from it:
At this point it has no extra value. From my perspective; if wave app deployed to the cloud the API will take care of provisioning DAI for you. The process will be hidden behind this simple API calls. Otherwise, DS would have been exposed to provisioning API which is not their task to take care of.
This makes total sense and is nice. It was also nice that it spun up h2o-3 for me automatically. I think some advanced users might want to have a say in which version of H2o-3 or DAI is used. DAI especially for LTS vs. bleeding edge
@mtanco I updated the branch and the example. It's simpler. Please take a look if have a time.
model.predict(test_dt.to_tuples()) failed for me
I made some fixes there, please take a look.
model.predict(test_path) doesn't work which is inconsistent
This works now, model.predict()
is able to swallow python object and filepath as well.
There doesn't seem to be much to do with the model object, I would like to see the confusion matrix or standard regression validation metrics Model.Type only says it's H2O-3, I want to know what algo it is (gbm, rf, etc.)
I talked to @lo5 about this and the plan is to develop model.explain()
call. The output of the function is yet to be determined.
I'm also getting weird predictions sometimes (not sure what's happening). I hope you will see it and tell me what's happening instantly 😂 .
New feedback (should I make these each separate issues?)
Should h2o_wave come with h2o? Or do we need to have h2o in the requirements.txt of examples?
Does this doc string mean model type like "h2o-3" vs. "Driverless AI"? As an analytics person model type to me means "which machine learning algorithm" like Random Forest vs. XGBoost. WDYT about "model_environment" or something like that?
If
model_type
not specified the function will determine correct model based on a current environment.
In general, this model is going to be bad because this is a time series problem but H2O AutoML doesn't do that automatically. So normally you would look at the average past sales to predict the future, but h2o-3 only looks at the current row.
This chart looks funky though, and I think it's not the predicts but that each date has multiple rows (one per store). but I am still looking into that. I would be easier to tell once that explain
is available and we can see things like model.varimp()
which would show us import variables in the model
Have a few more comments but I think they will be a stand-alone-issues and need more thought first.
New feedback (should I make these each separate issues?)
Up to you, but I think this thread is good enough.
Should h2o_wave come with h2o? Or do we need to have h2o in the requirements.txt of examples?
This will be handled by h2o SDK. If user wants to install predictive background they should include ml
option:
pip install h2o-wave[ml]
Does this doc string mean model type like "h2o-3" vs. "Driverless AI"?
Exactly.
As an analytics person model type to me means "which machine learning algorithm" like Random Forest vs. XGBoost. WDYT about "model_environment" or something like that?
I see, will think about it.
Thanks @mtanco !
In general, this model is going to be bad because this is a time series problem but H2O AutoML doesn't do that automatically.
Is time series a problem that build_model()
has to know about it? Or should it be recognized automagically? Should we cover it at all? cc @lo5
This chart looks funky though, and I think it's not the predicts but that each date has multiple rows (one per store).
I see. Let's just use dataset and a problem you are certain will work. Can you suggest something I can use in this, internal, example?
I have one other question. Is it a standard that test data doesn't contain target/predicted column at all? I'm not asking because of DS point of view but rather technically. I think we need to cover both cases.
Feedback and thread were very useful and served the purpose.
Closing, Thanks @mtanco and @lo5!
Started here:
https://github.com/h2oai/wave/blob/feat/common-api/py/examples/h2o3.py
without reading extensive docs or anything, just feedback from initial usage.System Setup
Notes
Inconsistency of data interactions
build_model(train_path, target=target)
Build model requires a file path (pandas is not an option) butpredict = model.predict(test_df)
cannot take a path and has to be the actual data. I don't personally care which but all methods should take the same "type" of data whether that is a file path or data objectExample didn't work I had data type issues when predicting with a datatable to tuple object:
predict = model.predict(test_dt.to_tuples())
Switching to pandas fixed this but we should be careful with enum vs. numeric handling to avoid these types of errors
What makes this modeling package easier? I've only tried this one example at this point, but at this point I am not seeing why this is easier for me to use than "regular" AutoML. Perhaps it's more about when DAI is in the mix too.
We need to make visualizing easier.
This is a lot. I am knew to the whole data buffers thing, but I can't just look at this and know how to use it on my data. It is going to take some time to sit-down and understand what
values_train
is so that I can figure out how to show data in cards how I want to.I personally see Predictions -> Wave App as the most important differentiator. Our customers can build models, let's make the "turn models into beautiful apps" part even easier.
I will read the docs and do more work now, but initial thoughts