microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

difference in result between machines (tried nthread) #1882

Closed germayneng closed 5 years ago

germayneng commented 5 years ago

Good day,

I realized that in different machines, I am not able to reproduce the exact same results from lgbm. In the same machine, running the sript multiple times give me the same results. So i do not think it is from the script itself.

By setting up a EC2 in aws and creating a virtual venv and installing the requirement,txt (so the modules are the same with my local virtual environment. )

There are some sort of randomness still in the prediction. Of course, using the aws and running multiple times gives the same results, except that it is simply different from the local one.

In sum: i created 3 environment: local base, local venv and aws venv. Running the scripts give same results in the same environment but differ across environment. Seed are already set and nthread =1

Some params:

params = { 'nthread' : 1, 'learning_rate': 0.01, 'num_leaves':80, 'max_depth': 7, 'boosting': 'gbdt', 'objective': 'regression', 'metric': 'rmse', 'colsample_bytree': 0.5,

'feature_fraction_seed': 1,

'subsample': 0.7,
'subsample_freq': 3,
'min_child_weight': 0,
'verbose': 0,
'min_gain_to_split': 3, # gamma
'boost_from_average':False,
'seed': 12}

and my requirement.txt to ensure lgbmm is the same version:

pandas==0.23.0 numpy==1.14.3 lightgbm==2.1.2 pmdarima==1.0.0 patsy==0.5.0 statsmodels==0.9.0 scipy==1.1.0 tqdm==4.23.4 scikit-learn==0.19.1

will be good to hear what may caused this. thanks!

Laurae2 commented 5 years ago

Do you have exactly the same environment? Such as identical gcc, identical OS version, identical kernel.

germayneng commented 5 years ago

hi laurae,

for the case of aws vs local, i may not guarantee that i have the same os ... etcc but if it is base vs venv with the same lgbm version, will it guarantee the same results?

edit: Is there more modules that i should also include the requirements.txt to ensure similar environment?

Laurae2 commented 5 years ago

No, you need at least an identical OS and an identical compiler (and probably an identical kernel now due to cache trashing from security mitigations) to guarantee identical random number generation.

germayneng commented 5 years ago

Thank you for the prompt reply.

So basically the random number generation is the factor that is differing across the environment? Any methods to ensure compiler and kernel to be the same?

edit: When u mean compiler, do you mean the python version? If so i have the same python version in the venv at 3.6.5

edit2:

So basically I have:

(base)

(local1)

(local2)


(aws 1)

all set up using the requirement.txt above. Same python. Only local 1 and local 2 have the same results. Base and aws1 are differing.

Laurae2 commented 5 years ago

If getting exactly identical results is critical for your task, install the correct OS and the corresponding kernel with identical OS packages of the same repositories to have identical environments (this means fully wiping and reinstalling from scratch either your local machine or the remote machine). If using Ubuntu there are no guarantees a setup script will provide identical OS environments due to partial control over updates from repositories for apt-get. For instance, 2 identical RHEL / SUSE (Enterprise distributions, stable) gives same result, while 2 identical Ubuntu might not.

guolinke commented 5 years ago

You can try the latest version as well. I remember we fix a bug for consistency.( std::sort -> std::stable_sort) BTW, what is the result when removed sampling (row and col)?

germayneng commented 5 years ago

@guolinke i believe 2.1.2 is the latest? Alsso, do you mean removing

'colsample_bytree': 0.5, 'subsample': 0.7, 'subsample_freq': 3,

Edit: Yes i removed the settings above but still not able to replicate across environments/machines. Tested (base) vs (local venv1) Probably will diff on AWS as well.

guolinke commented 5 years ago

The latest is 2.2.2

germayneng commented 5 years ago

Thanks @guolinke and @Laurae2

Managed to isolate the issue. It is not with lgbm. With the params as above, the result can be reproduced, even with lgbm 2.2.2 vs 2.1.2.

Issue was another module: pmdarima which caused a slightly variation in float numbers but was hidden in between the dataset. (in the preprocessing stage of the data) so lgbm was training on a slightly different dataset (by few float points)