Test failures on M1 Mac

siboehm commented 2 years ago

When running the (non-benchmark) test suite, 11 out of 90 tests are failing on an ARM M1 MBP, while the x86 CI continues running without errors. Seems to be related to fp NaN handling on ARM, haven't looked closely yet.

tests/test_categoricals.py::test_predict_pandas_categorical                   
tests/test_categoricals.py::test_pure_categorical_prediction 
tests/test_nans.py::test_zero_as_missing_categorical[1-True]                          
tests/test_nans.py::test_zero_as_missing_categorical[3-True]                          
tests/test_nans.py::test_zero_as_missing_categorical[5-True]                          
tests/test_nans.py::test_zero_as_missing_categorical[7-True]                          
tests/test_nans.py::test_zero_as_missing_categorical[9-True]                          
tests/test_nans.py::test_zero_as_missing_categorical[11-True]                          
tests/test_nans.py::test_lightgbm_nan_pred_inconsistency                          
tests/test_nans.py::test_nan_prediction_categorical

siboehm commented 2 years ago

This is exactly this issue: https://github.com/dmlc/treelite/issues/277

Minimally reproducible example:

``` tree version=v3 num_class=1 num_tree_per_iteration=1 label_index=0 max_feature_idx=0 objective=regression feature_names=Column_0 feature_infos=-1:0:1:2 tree_sizes=322 Tree=0 num_leaves=2 num_cat=1 split_feature=0 split_gain=500 threshold=0 decision_type=1 left_child=-1 right_child=-2 leaf_value=6.9999999920527145 6.5000000039736436 leaf_weight=30 60 leaf_count=30 60 internal_value=6.66667 internal_weight=0 internal_count=90 cat_boundaries=0 1 cat_threshold=1 is_linear=0 shrinkage=1 end of trees feature_importances: Column_0=1 parameters: [boosting: gbdt] [objective: regression] [metric: l2] [tree_learner: serial] [device_type: cpu] [data: ] [valid: ] [num_iterations: 1] [learning_rate: 0.1] [num_leaves: 31] [num_threads: 0] [deterministic: 0] [force_col_wise: 0] [force_row_wise: 0] [histogram_pool_size: -1] [max_depth: -1] [min_data_in_leaf: 20] [min_sum_hessian_in_leaf: 0.001] [bagging_fraction: 1] [pos_bagging_fraction: 1] [neg_bagging_fraction: 1] [bagging_freq: 0] [bagging_seed: 3] [feature_fraction: 1] [feature_fraction_bynode: 1] [feature_fraction_seed: 2] [extra_trees: 0] [extra_seed: 6] [early_stopping_round: 0] [first_metric_only: 0] [max_delta_step: 0] [lambda_l1: 0] [lambda_l2: 0] [linear_lambda: 0] [min_gain_to_split: 0] [drop_rate: 0.1] [max_drop: 50] [skip_drop: 0.5] [xgboost_dart_mode: 0] [uniform_drop: 0] [drop_seed: 4] [top_rate: 0.2] [other_rate: 0.1] [min_data_per_group: 100] [max_cat_threshold: 32] [cat_l2: 10] [cat_smooth: 10] [max_cat_to_onehot: 4] [top_k: 20] [monotone_constraints: ] [monotone_constraints_method: basic] [monotone_penalty: 0] [feature_contri: ] [forcedsplits_filename: ] [refit_decay_rate: 0.9] [cegb_tradeoff: 1] [cegb_penalty_split: 0] [cegb_penalty_feature_lazy: ] [cegb_penalty_feature_coupled: ] [path_smooth: 0] [interaction_constraints: ] [verbosity: 1] [saved_feature_importance_type: 0] [linear_tree: 0] [max_bin: 255] [max_bin_by_feature: ] [min_data_in_bin: 3] [bin_construct_sample_cnt: 200000] [data_random_seed: 1] [is_enable_sparse: 1] [enable_bundle: 1] [use_missing: 1] [zero_as_missing: 0] [feature_pre_filter: 1] [pre_partition: 0] [two_round: 0] [header: 0] [label_column: ] [weight_column: ] [group_column: ] [ignore_column: ] [categorical_feature: 0] [forcedbins_filename: ] [precise_float_parser: 0] [objective_seed: 5] [num_class: 1] [is_unbalance: 0] [scale_pos_weight: 1] [sigmoid: 1] [boost_from_average: 1] [reg_sqrt: 0] [alpha: 0.9] [fair_c: 1] [poisson_max_delta_step: 0.7] [tweedie_variance_power: 1.5] [lambdarank_truncation_level: 30] [lambdarank_norm: 1] [label_gain: ] [eval_at: ] [multi_error_top_k: 1] [auc_mu_weights: ] [num_machines: 1] [local_listen_port: 12400] [time_out: 120] [machine_list_filename: ] [machines: ] [gpu_platform_id: -1] [gpu_device_id: -1] [gpu_use_dp: 0] [num_gpu: 1] end of parameters pandas_categorical:null ```

tree

import os
import lleaves
import lightgbm as lgb
import numpy.testing as npt

os.environ["LLEAVES_PRINT_UNOPTIMIZED_IR"] = "1"
os.environ["LLEAVES_PRINT_ASM"] = "1"

model_path = "faulty_model.txt"
llvm_model = lleaves.Model(model_file=model_path)
llvm_model.compile()
lgbm_model = lgb.Booster(model_file=model_path)

data = [[float("NaN")]]
npt.assert_equal(llvm_model.predict(data), lgbm_model.predict(data))

lleaves returns 7.0, lightgbm returns 6.5

siboehm commented 2 years ago

This happens because we cast the fp-inputs to int using LLVM's fptosi. This yields poison for NaN, which just happened to work out correctly on x86 backend. Instead the cast to int should be moved into the decision node, performing a NaN check before doing the cast.

siboehm / lleaves

Test failures on M1 Mac #17