[Bug]: The prediction module does not load the prediction table

vladimirsim commented 5 days ago

JASP Version

0.19.1

Commit ID

84d54b934fa27731bb9eec44a4aa5f7ab0744dfd

JASP Module

Machine Learning

What analysis are you seeing the problem on?

Machine Learning > Prediction

What OS are you seeing the problem on?

Windows 11

Bug Description

This bug report is probably related to Bug report #2978 which was also submitted by me 3 weeks ago and it is closed now. I am using JASP 0.19.2.0 which is a nightly build. I open a Training database (CSV file) and train a Random forest model. Then I save the trained model. Then I open a test database which is exactly the same format as the training dataset (I know for sure because they were part of the same worksheet which I split into a training and a test dataset). Then I load the trained model and I try to get predictions for the test dataset, but the Prediction table does not load. The error says that a predictor in the test data set is of different format from its format in the training dataset. But this is simply not true.

Expected Behaviour

The Prediction table should load.

Steps to Reproduce

Open JASP and load the IndivLoansTrainingSample.csv file (cannot attach it because it is confidential)
Review the uploaded data file. JASP automatically assigns a type to each variable. It assigns Ordinal type to some variables, but it seems that Ordinal is not acceptable in the Random forest model, is it? I think this is a bug.
Optional: Change the type of some variables. For example, from Ordinal to Nominal or from Nominal to Scale. If JASP erroneously considers a Scale variable as Nominal, it will have a huge effect on the Random Forest model, will it not?
Open the Machine Learning module and train a Random forest model. (I am attaching the trained model)
Open another instance of JASP and load the IndivLoansTestSample.csv file (cannot attach it because it is confidential)
Review the uploaded data. Make sure that all variables in the Test dataset are exactly the same format as in the Training dataset. Change data types if needed.
Load the Machine Learning>Prediction>Prediction module
Load the trained model
Build the prediction table by picking the right predictors from the trained model. When you add all of the required predictors from the training model (not all variables in the dataset were used to train a model), you will get a message that a variable is in a different format. stop('Type of predictors in new data do not match that of the training data.')
The bug is still there even if I do not alter the type of any variable in the Training and Test sets. Even if I pick only several among those variables which were correctly recognized by JASP, I still get the same error, which is absurd because the predictors' type was automatically recognized by JASP and they were the same type, I checked it many times!
I have a suggestion: in the Machine Learning>Prediction>Prediction module, when I load the trained model, the prediction table tells me which predictors it expects me to load. Why don't you add the type of variable which JASP expects for each predictor? Let's say Length (Scale), TypeOfBondage (Nominal), etc. This will save a lot of nerves!
Also, if Ordinal variables are not acceptable in the Random Forest algorithm, why don't you prevent their usage? ...

Log (if any)

-------- Application Info -------- JASP Version: JASP 0.19.2 Build Branch: HEAD Build Date: Nov 26 2024 18:09:03 (Netherlands) Last Commit: 84d54b934fa27731bb9eec44a4aa5f7ab0744dfd

-------- Basic Info -------- Operating System: Windows 11 Version 23H2 Product Version: 11 Kernel Type: winnt Kernel Version: 10.0.22631 Architecture: x86_64 Install Path: D:/Program Files/JASP Platfotm Name: windows System Local: bg_BG

-------- Extra Info -------- Current code page Active code page: 437 Active code page: 65001

Host Name: SHOSHOCI OS Name: Microsoft Windows 11 Pro OS Version: 10.0.22631 N/A Build 22631 OS Manufacturer: Microsoft Corporation OS Configuration: Standalone Workstation OS Build Type: Multiprocessor Free Registered Owner: 359898893538 Registered Organization:
Product ID: 00330-52813-47920-AAOEM Original Install Date: 31.1.2023 г., 12:22:04 System Boot Time: 27.11.2024 г., 9:44:36 System Manufacturer: LENOVO System Model: 82LM System Type: x64-based PC Processor(s): 1 Processor(s) Installed. 01: AMD64 Family 23 Model 104 Stepping 1 AuthenticAMD ~2100 Mhz BIOS Version: LENOVO G5CN64WW(V2.10), 6.10.2022 г. Windows Directory: C:\Windows System Directory: C:\Windows\system32 Boot Device: \Device\HarddiskVolume1 System Locale: en-us;English (United States) Input Locale: en-us;English (United States) Time Zone: (UTC+02:00) Helsinki, Kyiv, Riga, Sofia, Tallinn, Vilnius Total Physical Memory: 15 706 MB Available Physical Memory: 8 855 MB Virtual Memory: Max Size: 16 730 MB Virtual Memory: Available: 7 726 MB Virtual Memory: In Use: 9 004 MB Page File Location(s): C:\pagefile.sys Domain: WORKGROUP Logon Server: \SHOSHOCI Hotfix(s): 5 Hotfix(s) Installed.

                       [02]: KB5012170
                       [03]: KB5027397
                       [04]: KB5046633
                       [05]: KB5044620

Network Card(s): 2 NIC(s) Installed. 01: Realtek 8822CE Wireless LAN 802.11ac PCI-E NIC Connection Name: Wi-Fi Status: Media disconnected [02]: Realtek USB GbE Family Controller Connection Name: Ethernet DHCP Enabled: Yes DHCP Server: 192.168.1.1 IP address(es)

                             [02]: fe80::f73:9fc3:5374:b2df
                             [03]: fda9:de81:d862:0:bdaa:acda:e64e:528a
                             [04]: fda9:de81:d862:0:d3ed:28cb:8c0e:2133

Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.

JASP 2024-11-27 14_21_05 Desktop.log JASP 2024-11-27 14_21_05 Engine 1.log

More Debug Information

This is the error message which I get when the Prediction table fails to load:

This analysis terminated unexpectedly.

Error in randomForest:::predict.randomForest(model, newdata = dataset): Type of predictors in new data do not match that of the training data.

Stack trace analysis(jaspResults = jaspResults, dataset = dataset, options = options)

.mlPredictionsTable(model, dataset, options, jaspResults, ready, position = 2)

.mlPredictionsState(model, dataset, options, jaspResults, ready)

createJaspState(.mlPredictionGetPredictions(model, dataset))

jaspStateR$new(object = object, dependencies = dependencies)

initialize(...)

.mlPredictionGetPredictions(model, dataset)

.mlPredictionGetPredictions.randomForest(model, dataset)

randomForest:::predict.randomForest(model, newdata = dataset)

stop('Type of predictors in new data do not match that of the training data.')

To receive assistance with this problem, please report the message above at: https://jasp-stats.org/bug-reports

Final Checklist

[X] I have included a screenshot showcasing the issue, if possible.
[X] I have included a JASP file (zipped) or data file that causes the crash/bug, if applicable.
[X] I have accurately described the bug, and steps to reproduce it.

vladimirsim commented 5 days ago

And here is the trained model, sorry, I forgot to attach it to my original publication! 27112024RFIndivLoansFincaJor.zip

koenderks commented 5 days ago

I'm afraid that without the dataset we cannot do a deep dive into this problem, can you change the values in the data so that they are unrecognisable and then attach it?

I do see that the error comes from randomForest:::predict.randomForest, particularly from the line

if (!all(object$forest$ncat == cat.new)) 
      stop("Type of predictors in new data do not match that of the training data.")

which checks if the number of categories in the nominal predictor variables in the training set (object$forest$ncat) are equal to the number of categories in the nominal predictor variables in the prediction data (cat.new).

Do you have categories of the nominal variables in the training data that do not occur in the prediction data, or vice versa?

vladimirsim commented 5 days ago

Do you have categories of the nominal variables in the training data that do not occur in the prediction data, or vice versa?

Yes, I do. The training set is 'broader' than the prediction set. I have checked it a variable by variable, and made sure that all categories in the prediction set already appeared in the training set. For example, if it comes to the nominal variable 'city', in the training set I might have loans from cities A, B and C, while in the prediction set I might have loans only from cities B and C. But I don't expect that this is a problem. I suppose a problem would arise if it was the other way around - if I wanted to get a prediction for a category on which the model was not trained, right?

koenderks commented 5 days ago

I’m wouldn’t expect this to be a problem either, but it is in the randomForest code ;) could you verify if this is the problem by making some additional rows with those missing levels of the nominal variables in the prediction data set?

vladimirsim commented 5 days ago

You were right, @koenderks ! In the prediction set, I removed 2 nominal variables which had fewer categories than in the training set, and the prediction table loaded. The prediction module worked well and I was able to generate and export predictions to a csv file. But I still believe that this is a bug. I came upon another failure while working with the prediction set: when I checked the Explain predictions checkbox, I got the following error: no applicable method for 'predict' applied to an object of class 'randomForest'. Please see the screenshot.

I do not understand what this error is due to. I am attaching the log files, too.

JASP Log files vladimirsim.zip

koenderks commented 5 days ago

I think we resolved this last bug that you report in https://github.com/jasp-stats/jaspMachineLearning/pull/393.

But I agree, the check in randomForest for equal levels seems a bit overkill to me. I guess we could fix it by manually assigning the factors in the prediction data the same levels as those in the training data, even though they do not exist in the prediction data. @vandenman Do you see any problems with this?

koenderks commented 4 days ago

Here is a reproducible example:

Random forest regression model trained on the iris dataset to predict Sepal.Length based on Sepal.Width, Petal.Length, Petal.Width and Species. model.jaspML.zip

Prediction dataset where Species has only 2 factor levels: prediction_data.csv

koenderks commented 4 days ago

This pull request should fix the issue! I confirmed that the explain predictions table also works in the latest version of JASP (0.19.2, coming out soon).

jasp-stats / jasp-issues