haghish / mlim

mlim: single and multiple imputation with automated machine learning
Other
30 stars 1 forks source link

java.lang.ArrayIndexOutOfBoundsException: Index 57 out of bounds for length 57 #1

Closed kstawiski closed 2 years ago

kstawiski commented 2 years ago

Hi, Thank you for your constant work on this package. It is awesome. I had no problems using it so far and it gives very good results. However, I have problem with my latest dataset and cannot figure out what is the problem.. This is the error:

> dane_elnet <- mlim(as.data.frame(x_miss), m=1, seed = 2022, tuning_time = 900, algos = c("ELNET"), report = "imputation_ELNET.md")

Random Forest preimputation in progress...

data 1, iteration 1 (RAM = 138.897 GiB):

  |                                                                                                                      |   0%
21:40:03.915: GLM_1_AutoML_1_20220908_214002 [GLM def_1] failed: java.lang.ArrayIndexOutOfBoundsException: Index 57 out of bounds for length 57
21:40:03.935: Empty leaderboard.
AutoML was not able to build any model within a max runtime constraint of 900 seconds, you may want to increase this value before retrying.
21:50:18.391: New models will be added to existing leaderboard mlim@@TNM (leaderboard frame=null) with already 0 models.
21:50:18.769: GLM_2_AutoML_2_20220908_215018 [GLM def_1] failed: java.lang.ArrayIndexOutOfBoundsException: Index 57 out of bounds for length 57
21:50:18.785: Empty leaderboard.
AutoML was not able to build any model within a max runtime constraint of 900 seconds, you may want to increase this value before retrying.
21:20:07.760: New models will be added to existing leaderboard mlim@@TNM (leaderboard frame=null) with already 0 models.
21:20:07.990: GLM_3_AutoML_3_20220909_212007 [GLM def_1] failed: java.lang.ArrayIndexOutOfBoundsException
21:20:08.1: Empty leaderboard.
AutoML was not able to build any model within a max runtime constraint of 900 seconds, you may want to increase this value before retrying.connection to JAVA server failed...

Error in value[[3L]](cond) : Java server crashed. perhaps a RAM problem?
In addition: Warning message:
In .automl.fetch_state(project_name) :
  The leaderboard contains zero models: try running AutoML for longer (the default is 1 hour).

Dataset seems to cleaned and formatted nicely. I also cannot figure out what Index 57 out of bounds for length 57 is referring to..

> str(x_miss)
'data.frame':   1641 obs. of  42 variables:
 $ Wiek               : num  69.6 73.1 76.5 65.1 63.3 48.4 68.8 69.5 78.2 71.1 ...
 $ EBRT_BT            : Factor w/ 2 levels "BT BOOST","EBRT": 2 2 2 2 2 2 2 2 2 2 ...
 $ GGG                : num  2 5 1 4 1 3 2 1 5 1 ...
 $ cores              : num  6 6 10 6 6 NA NA 6 12 6 ...
 $ cores_positive     : num  6 1 6 5 2 NA NA 2 12 1 ...
 $ cores_positive_proc: num  1 0.167 0.6 0.833 0.333 ...
 $ max_prccancer      : num  NA 50 100 100 NA NA 100 50 NA 20 ...
 $ TURP               : Factor w/ 2 levels "0_No","1_Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ V_prostata         : num  31.9 35.7 67.2 38.5 26.5 35.9 NA 60 NA 156 ...
 $ MR_pre_EPE         : Factor w/ 2 levels "0_No","1_Yes": 2 NA NA NA NA NA 1 NA NA NA ...
 $ MR_pre_SVI         : Factor w/ 2 levels "0_No","1_Yes": 2 NA NA NA NA NA 1 NA NA NA ...
 $ PSA_density        : num  0.72 0.59 1.29 1.09 1.53 4.74 NA 0.25 NA 0.07 ...
 $ TNM                : Factor w/ 7 levels "T1c","T2a","T2b",..: 7 3 3 6 1 6 NA 3 5 2 ...
 $ ZUBROD             : num  0 1 1 0 0 0 0 0 0 0 ...
 $ PSAmax             : num  23 21 86.4 41.8 40.5 ...
 $ Risk_Group         : Factor w/ 4 levels "1_low_IR","2_high_IR",..: 4 4 3 4 3 4 3 2 4 2 ...
 $ ADT_pre_RT         : Factor w/ 2 levels "0_No","1_Yes": 2 2 2 2 2 2 2 2 2 1 ...
 $ ADT_intractu_RT    : Factor w/ 2 levels "0_No","1_Yes": 2 2 2 2 2 2 2 2 2 1 ...
 $ ADT_ADJ            : Factor w/ 2 levels "0_No","1_Yes": 1 2 2 2 2 2 2 2 2 1 ...
 $ ADT_typ            : Factor w/ 5 levels "0_brak","analog",..: 3 3 3 3 3 3 3 3 3 1 ...
 $ ADT_czas_pre_RT    : num  100 110 104 92 74 76 101 90 113 1 ...
 $ ADT_ADJ_CZAS       : num  0 2.66 51.12 74.8 9.63 ...
 $ ADT_czas_suma      : num  5.81 11.4 59.1 80.35 13.7 ...
 $ czas_PSA_pre       : num  NA NA 0.76 8.74 1.38 NA NA 6.44 NA 0.66 ...
 $ PSA_pre_RT         : num  0.08 0.04 0 41.8 23.28 ...
 $ czas_RT            : num  38 40 62 50 61 39 50 50 58 46 ...
 $ DCp                : num  38 42 44 54 58 60 60 62 72 68 ...
 $ N_RT               : num  1 1 1 1 1 1 0 1 1 0 ...
 $ DCn                : num  38 42 44 50 44 50 0 50 43.2 0 ...
 $ DCbt               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BTfx               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BED3               : num  63.3 70 73.3 90 96.7 ...
 $ BED1_5             : num  88.7 98 102.7 126 135.3 ...
 $ FU                 : num  0.5 4 53.2 102.2 11.6 ...
 $ Zgon               : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BC                 : num  0 0 0 0 0 1 0 0 1 1 ...
 $ MFS_24             : num  1 0 1 1 0 1 1 1 1 1 ...
 $ FFM                : num  0 0 0 0 0 1 0 0 0 1 ...
 $ BC_czas            : num  0.53 3.98 53.15 102.23 11.63 ...
 $ FFM_czas           : num  0.53 3.98 53.15 102.23 11.63 ...
 $ MFS_czas_24        : num  23.72 3.98 53.15 102.76 11.63 ...
 $ OS                 : num  23.7 112.5 53.2 102.8 84.3 ...

I have also not found any suspicious variables...

> caret::nearZeroVar(x_miss)
integer(0)

All other packages seems to handle well description of missing values...

naniar::vis_miss(x_miss, sort_miss = T)

image

Any suggestion would be really helpful. Thank you in advance. Konrad

haghish commented 2 years ago

it seems that ELNET turns factors into dummy variables and some combinations of these dummy variables - sometimes even one variable - makes the model not to converge and this odd error is showed that is not helpful at all. it can happen even if the data has no missing values at all, due to my tests. I will have to look into this in more details in the future, but based on what I have investigated, this is not a bug.

When I implemented the solution 1 (below) in my tests, I got a different error about ELNET: (failed: java.lang.AssertionError: Multinomial coefficents cannot be null), which makes it much clearer why the model does not converge!

solutions

  1. add "RF" or any other supported algorithms, in addition to ELNET.
  2. in the new version, when this error happens, that variable is just ignored. so it is up to you to add a secondary imputation variable.