scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
58.78k stars 25.13k forks source link

Random Forest predict() does not produce reproducible results. random_state=42 #28920

Open aedavids opened 2 months ago

aedavids commented 2 months ago

Describe the bug

If I load my pre trained model and set of samples and call predict() multiple times I get different predicted classes. Here are some sample results. I am using a juypter notebook. I have tried restarting the kernal multiple times and also just re-running the cell multiple times

auc: {0: 0.476, 1: 0.524} pred: [0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1]
auc: {0: 0.613, 1: 0.387} pred: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1]
auc: {0: 0.762, 1: 0.238} pred: [1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
auc: {0: 0.589, 1: 0.411} pred: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

I have a random forest I trained with the following parameters

RandomForestClassifier(max_depth=7, max_features=1, max_samples=0.9,
                       n_estimators=50, random_state=42)

The model was save using joblib. I load the model as follows

model = joblib.load(modelPath)

I make predictions as follow

predictions  = model.predict(XNP)

yProbability = model.predict_proba(XNP)

yNP:
[0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1]

XNP = np.array([[ 16,   9,   0,   0,   5,   0, 104,   1,   1,   1],
           [ 19,   4,   0,   0,   4,   0,  96,   0,   2,   0],
           [ 14,   7,   0,   0,   5,   0,  72,   0,   2,   0],
           [ 29,   5,   0,   0,  11,   0, 108,   0,   1,   0],
           [ 16,   9,   0,   0,   6,   0,  80,   0,   1,   1],
           [ 49,  13,   0,   0,  20,   0, 198,   0,   5,   2],
           [ 45,   7,   0,   0,   7,   0, 163,   0,   1,   1],
           [ 47,  13,   0,   1,  10,   0, 229,   0,   4,   1],
           [ 17,  21,   0,   0,   2,   0,  61,   0,   5,   0],
           [ 56,  15,   0,   0,  12,   0, 362,   0,   4,   1],
           [ 14,   7,   0,   0,   8,   0, 113,   0,   1,   0],
           [  5,   3,   0,   0,   1,   0,  49,   0,   0,   0],
           [ 23,   7,   0,   0,   8,   0,  92,   0,   2,   0],
           [ 15,  12,   0,   0,   3,   0, 119,   0,   0,   1],
           [ 18,   4,   0,   0,   1,   0, 133,   0,   0,   0],
           [ 13,   3,   0,   0,   4,   0, 126,   0,   0,   0],
           [ 20,   3,   0,   0,   5,   0, 161,   0,   0,   0],
           [ 15,   6,   0,   0,   4,   0, 163,   0,   0,   0],
           [ 23,   4,   0,   0,   8,   0, 127,   0,   0,   2]])

I have tried setting calling random.seed()

Any suggestions would be greatly apreciated.

p.s. When I trained I save the label encoder and load as follows. (This was to insure the class number match the class names)

def encoder2Dict(encoder : LabelEncoder) -> dict  :
    '''
    key is class
    value is int
    '''
    values = encoder.transform(encoder.classes_)
    retDict = dict(zip(encoder.classes_, values))
    return retDict

def loadEncoder(path: str) -> LabelEncoder:
    '''
    arguments:
        path: file containing labelEncoder values saved as a dictionary
    '''
    encoder = LabelEncoder()
    encoderDict = loadDictionary(path)

    # Manually assign the sorted list of class labels to the classes_ attribute
    # The keys of the dictionary are sorted according to their corresponding values
    # dictionary.get(key) returns the value value
    encoder.classes_ = np.array(sorted(encoderDict, key=encoderDict.get))

    return encoder

I can make my trained model avaliable

Steps/Code to Reproduce

predictions = model.predict(XNP)

yProbability = model.predict_proba(XNP)

Expected Results

predict(X) == predict(X)

Actual Results

auc: {0: 0.476, 1: 0.524} pred: [0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1]
auc: {0: 0.613, 1: 0.387} pred: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1]
auc: {0: 0.762, 1: 0.238} pred: [1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
auc: {0: 0.589, 1: 0.411} pred: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Versions

System:
    python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
executable: /private/home/aedavids/miniconda3/envs/POC/bin/python
   machine: Linux-5.15.0-89-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.4.0
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.3
        scipy: 1.11.4
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.7.1
       joblib: 1.4.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/libopenblasp-r0.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell
    num_threads: 128

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/libgomp.so.1.0.0
        version: None
    num_threads: 160
$ conda list scikit-learn
# packages in environment at /private/home/aedavids/miniconda3/envs/extraCellularRNA:
#
# Name                    Version                   Build  Channel
scikit-learn              1.4.0           py311hc009520_0    conda-forge

$ python --version
Python 3.11.4
glemaitre commented 2 months ago

We will need the data to understand what is the reason but I suspect that the issue is linked to random tie breaking.

betatim commented 2 months ago

Please also provide a short code snippet that we can copy&paste to reproduce the problem. From reading your original comment it sounds like you are using more than just a RandomForestClassifier. Having a full snippet from start to finish makes sure we are all debugging the same thing.

aedavids commented 1 month ago

Hi All

I am in the process of creating test code I can post. I have narrowed it down a bit. The problem happens in my jupyter notebook. If I run the predict cell multiple times I get the same results. If I restart the notebook I will get different results from the first run

I wrote a small py script. I can not reproduce the error when I run from the terminal.

I going to try and and figure out how I can isolate the problem in my Notebook. I will post the test notebook

Hopefully I can upload a zip file with the test code and my trained model

Kind regards

Andy