Recipes for H2O Driverless AI
About Driverless AI
H2O Driverless AI is Automatic Machine Learning for the Enterprise. Driverless AI automates feature engineering, model building, visualization and interpretability.
About BYOR
BYOR stands for Bring Your Own Recipe and is a key feature of Driverless AI. It allows domain scientists to solve their problems faster and with more precision.
What are Custom Recipes?
Custom recipes are Python code snippets that can be uploaded into Driverless AI at runtime, like plugins. No need to restart Driverless AI. Custom recipes can be provided for transformers, models and scorers. During training of a supervised machine learning modeling pipeline (aka experiment), Driverless AI can then use these code snippets as building blocks, in combination with all built-in code pieces (or instead of). By providing your own custom recipes, you can gain control over the optimization choices that Driverless AI makes to best solve your machine learning problems.
Best Practices for Recipes
Security
-
Recipes are meant to be built by people you trust and each recipe should be code-reviewed before going to production.
-
Assume that a user with access to Driverless AI has access to the data inside that instance.
- Apart from securing access to the instance via private networks, various methods of authentication are possible. Local authentication provides the most control over which users have access to Driverless AI.
- Unless the
config.toml
setting enable_dataset_downloading=false
is set, an authenticated user can download all imported datasets as .csv via direct APIs.
-
When recipes are enabled (enable_custom_recipes=true
, the default), be aware that:
- The code for the recipes runs as the same native Linux user that runs the Driverless AI application.
- Recipes have explicit access to all data passing through the transformer/model/scorer API
- Recipes have implicit access to system resources such as disk, memory, CPUs, GPUs, network, etc.
- An H2O-3 Java process is started in the background, for use by all recipes using H2O-3. Anyone with access to the Driverless AI instance can browse the file system, see models and data through the H2O-3 interface.
-
Enable automatic detection of forbidden or dangerous code constructs in a custom recipe with custom_recipe_security_analysis_enabled = tr ue
. Note the following:
- When
custom_recipe_security_analysis_enabled
is enabled, do not use modules specified in the banlist. Specify the banlist with the cu stom_recipe_import_banlist
config option.
- For example:
custom_recipe_import_banlist = ["shlex", "plumbum", "pexpect", "envoy", "commands", "fabric", "subprocess", "os.system", "system"]
(default)
- When
custom_recipe_security_analysis_enabled
is enabled, code is also checked for dangerous calls like eval()
, exec()
and other in
secure calls (regex patterns) defined in custom_recipe_method_call_banlist
. Code is also checked for other dangerous constructs defined as
regex patterns in the custom_recipe_dangerous_patterns
config setting.
- Security analysis is only performed on recipes that are uploaded after the
custom_recipe_security_analysis_enabled
config option is en
abled.
- To specify a list of modules that can be imported in custom recipes, use the
custom_recipe_import_allowlist
config option.
- The
custom_recipe_security_analysis_enabled
config option is disabled by default.
-
Best ways to control access to Driverless AI and custom recipes:
- Control access to the Driverless AI instance
- Use local authentication to specify exactly which users are allowed to access Driverless AI
- Run Driverless AI in a Docker container, as a certain user, with only certain ports exposed, and only certain mount points mapped
- To disable all recipes: Set
enable_custom_recipes=false
in the config.toml, or add the environment variable DRIVERLESS_AI_ENABLE_CUSTOM_RECIPES=0
at startup of Driverless AI. This will disable all custom transformers, models and scorers.
- To disable new recipes: To keep all previously uploaded recipes enabled and disable the upload of any new recipes, set
enable_custom_recipes_upload=false
or DRIVERLESS_AI_ENABLE_CUSTOM_RECIPES_UPLOAD=0
at startup of Driverless AI.
Safety
- Driverless AI automatically performs basic acceptance tests for all custom recipes unless disabled
- More information in the FAQ
Performance
- Use fast and efficient data manipulation tools like
datatable
, sklearn
, numpy
or pandas
instead of Python lists, for-loops etc.
- Use disk sparingly, delete temporary files as soon as possible
- Use memory sparingly, delete objects when no longer needed
Reference Guide
Sample Recipes
Go to Recipes for Driverless 1.7.0
1.7.1
1.8.0
1.8.1
1.8.2
1.8.3
1.8.4
1.8.5
1.8.6
1.8.7
1.8.8
1.8.9
1.8.10
1.9.0
1.9.1
1.9.2
1.9.3
1.10.0
1.10.1
1.10.2
1.10.3
1.10.4
1.10.4.1
1.10.4.2
1.10.4.3
1.10.5
Count: 277
- AIR-GAPPED_INSTALLATIONS
- DATA
- EXPLAINERS
- DOC
- API
- IMAGES
- EXPLAINERS
- ale_explainer.py [Accumulated Local Effects (ALE) explainerNote:This example repurposes the Partial Dependence format render data. As such, the label"Average Prediction of {response}" is used for the y-axis instead of "ALE of {response}".]
- morris_sensitivity_explainer.py [Morris Sensitivity Analysis Explainer]
- EXAMPLES
- TEMPLATES
- template_dt_explainer.py [Decision Tree explainer which can be used to create explainer with global and local decision tree explanations.]
- template_featimp_explainer.py [Feature importance explainer template which can be used create explainer with global and local feature importance explanations.]
- template_md_explainer.py [Markdown report with raster image chart explainer template which can be used to create explainer with global report explanations.]
- template_md_featimp_summary_explainer.py [Markdown report with summary feature importance chart explainer template which can be used to create explainer with global report explanations.]
- template_md_vega_explainer.py [Markdown report with Vega chart explainer template which can be used to create explainer which creates global report explanations.]
- template_pd_explainer.py [PD and ICE explainer template which can be used to create example with partial dependence (global) and individual conditional explanations (local) explanations.]
- template_scatter_plot_explainer.py [Scatter plot explainer template which can be used to create explainer with global and local explanations.]
- NOTEBOOKS
- HOW_TO_WRITE_A_RECIPE
- INDIVIDUALS
- MODELS
- RECIPES
- amazon.py [Recipe for Kaggle Competition: Amazon.com - Employee Access Challenge]
- REFERENCE
- SCORERS
- huber_loss.py [Huber Loss for Regression or Binary Classification. Robust loss, combination of quadratic loss and linear loss.]
- scorer_template.py [Template base class for a custom scorer recipe.]
- CLASSIFICATION
- f3_score.py [F3 Score]
- f4_score.py [F4 Score]
- probF.py [Probabilistic F Score with optimized threshold]
- BINARY
- average_mcc.py [Averaged Matthews Correlation Coefficient (averaged over several thresholds, for imbalanced problems). Example how to use Driverless AI's internal scorer.]
- balanced_accuracy.py [balanced_accuracy_score]
- brier_loss.py [Brier Loss]
- cost.py [Using hard-coded dollar amounts x for false positives and y for false negatives, calculate the cost of a model using:
(x * FP + y * FN) / N
]
- cost_access_to_data.py [Same as CostBinary, but provides access to full Data]
- cost_smooth.py [Using hard-coded dollar amounts x for false positives and y for false negatives, calculate the cost of a model using:
(1 - y_true) * y_pred * fp_cost + y_true * (1 - y_pred) * fn_cost
]
- fair_auc.py [Custom scorer for detecting and reducing bias in machine learning models.]
- kolmogorov_smirnov.py [Kolmogorov-Smirnov scorer recipe.If you need to print debug messages into DAI log, uncomment lines with logger and loggerinfo.Starting 1.10.2 - DAI handles exceptions raised by custom scorers.Default DAI behavior is to continue experiment in case of Scorer failure.To enable forcing experiment to fail, in case of scorer error, set following parameters in DAI: - skip_scorer_failures=false (Disabled) - skip_model_failures=false (Disabled)]
- logloss_with_costs.py [Logloss with costs associated with each type of 4 outcomes - typically applicable to fraud use case]
- marketing_campaign.py [Computes the mean profit per outbound marketing letter, given a fraction of the population addressed, and fixed cost and reward]
- profit.py [Profit Scorer for binary classification]
- MULTICLASS
- REGRESSION
- WAPE_scorer.py [Weighted Absoluted Percent Error]
- asymmetric_mae.py [MAE with a penalty that differs for positive and negative errors]
- auuc.py [Area under uplift curve]
- cosh_loss.py [Hyperbolic Cosine Loss]
- explained_variance.py [Explained Variance. Fraction of variance that is explained by the model.]
- gamma_deviance.py [Gamma Deviance scorer recipe.This is same as Tweedie Deviance scorer with power=2If you need to print debug messages into DAI log, uncomment lines with logger and loggerinfo.Starting 1.10.2 - DAI handles exceptions raised by custom scorers.Default DAI behavior is to continue experiment in case of Scorer failure.To enable forcing experiment to fail, in case of scorer error, set following parameters in DAI: - skip_scorer_failures=false (Disabled) - skip_model_failures=false (Disabled)]
- largest_error.py [Largest error for regression problems. Highly sensitive to outliers.]
- log_mae.py [Log Mean Absolute Error for regression]
- mean_absolute_scaled_error.py [Mean Absolute Scaled Error for time-series regression]
- mean_squared_log_error.py [Mean Squared Log Error for regression]
- median_absolute_error.py [Median Absolute Error for regression]
- pearson_correlation.py [Pearson Correlation Coefficient for regression]
- poisson_deviance.py [Poisson Deviance scorer recipe.]
- quantile_loss.py [Quantile Loss regression]
- r2_by_tgc.py [Custom R2 scorer computes R2 on each time series, then averages them out for the final score.]
- rmse_with_x.py [Custom RMSE Scorer that also gets X (original features) - for demo/testing purposes only]
- top_decile.py [Median Absolute Error for predictions in the top decile]
- tweedie_deviance.py [Tweedie Deviance scorer recipe.User inputs can be provided through recipe_dict in config.To pass power parameterrecipe_dict = "{'power':2.0}"The default value is 1.5If you need to print debug messages into DAI log, uncomment lines with logger and loggerinfo.Starting 1.10.2 - DAI handles exceptions raised by custom scorers.Default DAI behavior is to continue experiment in case of Scorer failure.To enable forcing experiment to fail, in case of scorer error, set following parameters in DAI: - skip_scorer_failures=false (Disabled) - skip_model_failures=false (Disabled)]
- TRANSFORMERS
- how_to_debug_transformer.py [Example how to debug a transformer outside of Driverless AI (optional)]
- how_to_test_from_py_client.py [Testing a BYOR Transformer the PyClient - works on 1.7.0 & 1.7.1-17]
- transformer_template.py [Template base class for a custom transformer recipe.]
- ANOMALY
- [isolation_forest.py](./transformers/anomaly /isolation_forest.py) [H2O-3 Distributed Scalable Machine Learning Transformers (IF)]
- AUGMENTATION
- germany_landers_holidays.py [Returns a flag for whether a date falls on a holiday for each of Germany's Bundeslaender. ]
- holidays_this_week.py [Returns the amount of US holidays for a given week]
- ipaddress_features.py [Parses IP addresses and networks and extracts its properties.]
- is_ramadan.py [Returns a flag for whether a date falls on Ramadan in Saudi Arabia]
- singapore_public_holidays.py [Flag for whether a date falls on a public holiday in Singapore.]
- usairportcode_origin_dest.py [Transformer to parse and augment US airport codes with geolocation info.]
- usairportcode_origin_dest_geo_features.py [Transformer to augment US airport codes with geolocation info.]
- uszipcode_features_database.py [Transformer to parse and augment US zipcodes with info from zipcode database.]
- uszipcode_features_light.py [Lightweight transformer to parse and augment US zipcodes with info from zipcode database.]
- DATETIME
- datetime_diff_transformer.py [Difference in time between two datetime columns]
- datetime_encoder_transformer.py [Converts datetime column into an integer (milliseconds since 1970)]
- days_until_dec2020.py [Creates new feature for any date columns, by computing the difference in days between the date value and 31st Dec 2020]
- EXECUTABLES
- pe_data_directory_features.py [Extract LIEF features from PE files]
- pe_exports_features.py [Extract LIEF features from PE files]
- pe_general_features.py [Extract LIEF features from PE files]
- pe_header_features.py [Extract LIEF features from PE files]
- pe_imports_features.py [Extract LIEF features from PE files]
- pe_normalized_byte_count.py [Extract LIEF features from PE files]
- pe_section_characteristics.py [Extract LIEF features from PE files]
- DATA
- GENERIC
- count_missing_values_transformer.py [Count of missing values per row]
- missing_flag_transformer.py [Returns 1 if a value is missing, or 0 otherwise]
- specific_column_transformer.py [Example of a transformer that operates on the entire original frame, and hence on any column(s) desired.]
- GEOSPATIAL
- geodesic.py [Calculates the distance in miles between two latitude/longitude points in space]
- myhaversine.py [Computes miles between first two _latitude and _longitude named columns in the data set]
- HIERARCHICAL
- firstNCharCVTE.py [Target-encode high cardinality categorical text by their first few characters in the string ]
- log_scale_target_encoding.py [Target-encode numbers by their logarithm]
- IMAGE
- image_ocr_transformer.py [Convert a path to an image to text using OCR based on tesseract]
- image_url_transformer.py [Convert a path to an image (JPG/JPEG/PNG) to a vector of class probabilities created by a pretrained ImageNet deeplearning model (Keras, TensorFlow).]
- NLP
- continuous_TextTransformer.py [Creates a TF-IDF based text transformation that can be continuously updated with new data and vocabulary.] ✓ MOJO Enabled
- fuzzy_text_similarity_transformers.py [Row-by-row similarity between two text columns based on FuzzyWuzzy]
- text_binary_count_transformer.py [Explainable Text transformer that uses binary counts of words using sklearn's CountVectorizer]
- text_char_tfidf_count_transformers.py [Character level TFIDF and Count followed by Truncated SVD on text columns]
- text_embedding_similarity_transformers.py [Row-by-row similarity between two text columns based on pretrained Deep Learning embedding space]
- text_lang_detect_transformer.py [Detect the language for a text value using Google's 'langdetect' package]
- text_meta_transformers.py [Extract common meta features from text]
- text_named_entities_transformer.py [Extract the counts of different named entities in the text (e.g. Person, Organization, Location)]
- text_pos_tagging_transformer.py [Extract the count of nouns, verbs, adjectives and adverbs in the text]
- text_preprocessing_transformer.py [Preprocess the text column by stemming, lemmatization and stop word removal]
- text_readability_transformers.py [ Custom Recipe to extract Readability features from the text data]
- text_sentiment_transformer.py [Extract sentiment from text using pretrained models from TextBlob]
- text_similarity_transformers.py [Row-by-row similarity between two text columns based on common N-grams, Jaccard similarity, Dice similarity and edit distance.]
- text_spelling_correction_transformers.py [Correct the spelling of text column]
- text_topic_modeling_transformer.py [Extract topics from text column using LDA]
- text_url_summary_transformer.py [Extract text from URL and summarizes it]
- vader_text_sentiment_transformer.py [Extract sentiment from text using lexicon and rule-based sentiment analysis tool called VADER]
- NUMERIC
- boxcox_transformer.py [Box-Cox Transform]
- clusterdist_all.py [Cluster Distance for all columns]
- count_negative_values_transformer.py [Count of negative values per row]
- count_positive_values_transformer.py [Count of positive values per row]
- exp_diff_transformer.py [Exponentiated difference of two numbers]
- factor_analysis_transformer.py [Factor Analysis Transformer]
- log_transformer.py [Converts numbers to their Logarithm] ✓ MOJO Enabled
- ohe.py [One-Hot Encoding for categorical columns]
- pca_transformer.py [Principal Component Analysis (PCA) Transformer]
- product.py [Products together 3 or more numeric features]
- random_transformer.py [Creates random numbers]
- round_transformer.py [Rounds numbers to 1, 2 or 3 decimals]
- square_root_transformer.py [Converts numbers to the square root, preserving the sign of the original numbers]
- sum.py [Adds together 3 or more numeric features]
- truncated_svd_all.py [Truncated SVD for all columns]
- yeojohnson_transformer.py [Yeo-Johnson Power Transformer]
- OUTLIERS
- h2o3-dl-anomaly.py [Anomaly score for each row based on reconstruction error of an H2O-3 deep learning autoencoder]
- quantile_winsorizer.py [Winsorizes (truncates) univariate outliers outside of a given quantile threshold]
- twosigma_winsorizer.py [Winsorizes (truncates) univariate outliers outside of two standard deviations from the mean.]
- RECOMMENDATIONS
- matrixfactorization.py [Collaborative filtering features using various techniques of Matrix Factorization for recommendations.Recommended for large data]
- SIGNAL_PROCESSING
- signal_processing.py [This custom transformer processes signal files to create features used by DriverlessAI to solve a regression problem]
- SPEECH
- audio_MFCC_transformer.py [Extract MFCC and spectrogram features from audio files]
- azure_speech_to_text.py [An example of integration with Azure Speech Recognition Service]
- STRING
- simple_grok_parser.py [Extract column data using grok patterns]
- strlen_transformer.py [Returns the string length of categorical values]
- to_string_transformer.py [Converts numbers to strings]
- user_agent_transformer.py [A best effort transformer to determine browser device characteristics from a user-agent string]
- SURVIVAL
- dummy-pretransformer.py [Dummy Pre-Transformer to use as a template for custom pre-transformer recipes. This transformer consumes all features at once, adds 'pre:' to the names and passes them down to transformer level and GA as-is.]
- h2o-3-coxph-pretransformer.py [Pre-transformer utilizing survival analysis modeling using CoxPH (Cox proportional hazard) using H2O-3 CoxPH function. It adds risk score produced by CoxPH model and drops stop_column feature used for survival modeling along with actual target as event.]
- TARGETENCODING
- ExpandingMean.py [CatBoost-style target encoding. See https://youtu.be/d6UMEmeXB6o?t=818 for short explanation]
- leaky_mean_target_encoder.py [Example implementation of a out-of-fold target encoder (leaky, not recommended)]
- TIMESERIES
- auto_arima_forecast.py [Auto ARIMA transformer is a time series transformer that predicts target using ARIMA models.]
- general_time_series_transformer.py [Demonstrates the API for custom time-series transformers.]
- parallel_auto_arima_forecast.py [Parallel Auto ARIMA transformer is a time series transformer that predicts target using ARIMA models.In this implementation, Time Group Models are fitted in parallel]
- parallel_prophet_forecast.py [Parallel FB Prophet transformer is a time series transformer that predicts target using FBProphet models.]
- parallel_prophet_forecast_using_individual_groups.py [Parallel FB Prophet transformer is a time series transformer that predicts target using FBProphet models.This transformer fits one model for each time group column values and is significantly fasterthan the implementation available in parallel_prophet_forecast.py.]
- serial_prophet_forecast.py [Transformer that uses FB Prophet for time series prediction.Please see the parallel implementation for more information]
- time_encoder_transformer.py [converts the Time Column to an ordered integer]
- trading_volatility.py [Calculates Historical Volatility for numeric features (makes assumptions on the data)]