microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

[R-package] R package crashes on windows when loaded together with {fansi} or anything that depends on it #4464

Closed dfalbel closed 3 years ago

dfalbel commented 3 years ago

This is probably related to:

Description

Using lightgbm while parsnip is loaded crashes the R session with: Exited with status -1073741819.

Reproducible example

Calling:

library(parsnip)
library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
       params = list(
       objective = "regression", 
       metric = "l2"
       ) , 
data = dtrain
)

Environment info

I am using the dev version of LightGBM as suggested in https://github.com/microsoft/LightGBM/issues/4007#issuecomment-869080432 The error only occurs on Windows.

Here's a GitHub actions run that shows the behavior. This shows that it works fine if parsnip is not loaded: https://github.com/curso-r/treesnip/runs/3037580458?check_suite_focus=true#step:9:1 And this one shows the error message: https://github.com/curso-r/treesnip/runs/3037580458?check_suite_focus=true#step:10:21

I could also reproduce it locally on a Windows machine, but I am not sure what's the best way to get a stack trace. Let me know if I can help with further debugging.

jameslamb commented 3 years ago

Thanks for the report and for using LightGBM @dfalbel !

I'll look into this as soon as possible.

shiyu1994 commented 3 years ago

Thanks for reporting that! I tested the example on my win10 machine, but failed to reproduce the error with the latest master of LightGBM. The script runs successfully and gives the correct output.

D:\Projects\Test-LightGBM\issues\4464>Rscript test.R
Loading required package: R6
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000203 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 107
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000235 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 107
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000233 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 107
[LightGBM] [Info] Start training from score 0.479503
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Start training from score 0.486872
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Start training from score 0.479963
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[1]:  valid's l2:0.20319+0.000285721"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[2]:  valid's l2:0.165525+0.000574973"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[3]:  valid's l2:0.134908+0.000749561"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[4]:  valid's l2:0.110093+0.000922039"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[5]:  valid's l2:0.0899506+0.00102425"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[6]:  valid's l2:0.0736391+0.00109707"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[7]:  valid's l2:0.0603161+0.00110421"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[8]:  valid's l2:0.0495564+0.0011122"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[9]:  valid's l2:0.0407735+0.00109775"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[10]:  valid's l2:0.0334856+0.000970481"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[11]:  valid's l2:0.0275363+0.000849414"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[12]:  valid's l2:0.0226964+0.000810414"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[13]:  valid's l2:0.0187499+0.000778224"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[14]:  valid's l2:0.0155632+0.000752949"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[15]:  valid's l2:0.0129092+0.000674334"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[16]:  valid's l2:0.0107217+0.00059569"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[17]:  valid's l2:0.00896862+0.000529726"
[1] "[18]:  valid's l2:0.00752793+0.000477273"
[1] "[19]:  valid's l2:0.00635069+0.000426635"
[1] "[20]:  valid's l2:0.00538788+0.000374957"
[1] "[21]:  valid's l2:0.00457729+0.000358718"
[1] "[22]:  valid's l2:0.00392614+0.000342183"
[1] "[23]:  valid's l2:0.00336796+0.000321502"
[1] "[24]:  valid's l2:0.00293595+0.000296263"
[1] "[25]:  valid's l2:0.00256147+0.000286571"
[1] "[26]:  valid's l2:0.00226154+0.000270801"
[1] "[27]:  valid's l2:0.00200893+0.000267164"
[1] "[28]:  valid's l2:0.00180272+0.000258289"
[1] "[29]:  valid's l2:0.00162937+0.000243161"
[1] "[30]:  valid's l2:0.00147082+0.000247366"
[1] "[31]:  valid's l2:0.00135177+0.000243783"
[1] "[32]:  valid's l2:0.00123631+0.000236365"
[1] "[33]:  valid's l2:0.00115095+0.000232943"
[1] "[34]:  valid's l2:0.00108048+0.000230003"
[1] "[35]:  valid's l2:0.00100614+0.000221995"
[1] "[36]:  valid's l2:0.000946318+0.000225411"
[1] "[37]:  valid's l2:0.000897226+0.000226746"
[1] "[38]:  valid's l2:0.00083996+0.000221873"
[1] "[39]:  valid's l2:0.00080131+0.000214173"
[1] "[40]:  valid's l2:0.000766308+0.000204982"
[1] "[41]:  valid's l2:0.000739083+0.000206319"
[1] "[42]:  valid's l2:0.000703267+0.000217342"
[1] "[43]:  valid's l2:0.000665415+0.000219365"
[1] "[44]:  valid's l2:0.000628061+0.000211541"
[1] "[45]:  valid's l2:0.000592348+0.000206586"
[1] "[46]:  valid's l2:0.000559744+0.000202899"
[1] "[47]:  valid's l2:0.000523506+0.000197195"
[1] "[48]:  valid's l2:0.00049955+0.000193268"
[1] "[49]:  valid's l2:0.000474157+0.00019095"
[1] "[50]:  valid's l2:0.000456001+0.000187329"
[1] "[51]:  valid's l2:0.000435425+0.000185016"
[1] "[52]:  valid's l2:0.000418417+0.000177465"
[1] "[53]:  valid's l2:0.000406039+0.000170052"
[1] "[54]:  valid's l2:0.000389491+0.000166786"
[1] "[55]:  valid's l2:0.00037612+0.000163033"
[1] "[56]:  valid's l2:0.000366619+0.000158277"
[1] "[57]:  valid's l2:0.000352018+0.000152857"
[1] "[58]:  valid's l2:0.00034011+0.000147394"
[1] "[59]:  valid's l2:0.000328484+0.000141576"
[1] "[60]:  valid's l2:0.000315826+0.000136087"
[1] "[61]:  valid's l2:0.000306009+0.000131264"
[1] "[62]:  valid's l2:0.000295355+0.000126489"
[1] "[63]:  valid's l2:0.000285643+0.000121594"
[1] "[64]:  valid's l2:0.000274675+0.000114519"
[1] "[65]:  valid's l2:0.00026645+0.000109754"
[1] "[66]:  valid's l2:0.000257835+0.000106091"
[1] "[67]:  valid's l2:0.000248925+0.000100773"
[1] "[68]:  valid's l2:0.00024091+9.73169e-05"
[1] "[69]:  valid's l2:0.00023334+9.3604e-05"
[1] "[70]:  valid's l2:0.00022485+8.88266e-05"
[1] "[71]:  valid's l2:0.000218256+8.58562e-05"
[1] "[72]:  valid's l2:0.000210262+8.06131e-05"
[1] "[73]:  valid's l2:0.000204809+7.71541e-05"
[1] "[74]:  valid's l2:0.000198144+7.26759e-05"
[1] "[75]:  valid's l2:0.000192143+7.10996e-05"
[1] "[76]:  valid's l2:0.000185914+6.73203e-05"
[1] "[77]:  valid's l2:0.000180159+6.46153e-05"
[1] "[78]:  valid's l2:0.000175122+6.20043e-05"
[1] "[79]:  valid's l2:0.000169991+5.93545e-05"
[1] "[80]:  valid's l2:0.000165344+5.86973e-05"
[1] "[81]:  valid's l2:0.000160885+5.60808e-05"
[1] "[82]:  valid's l2:0.00015688+5.39479e-05"
[1] "[83]:  valid's l2:0.000152405+5.12417e-05"
[1] "[84]:  valid's l2:0.000148674+4.95601e-05"
[1] "[85]:  valid's l2:0.000144452+4.74966e-05"
[1] "[86]:  valid's l2:0.000140023+4.55681e-05"
[1] "[87]:  valid's l2:0.000135932+4.28883e-05"
[1] "[88]:  valid's l2:0.000131253+4.12862e-05"
[1] "[89]:  valid's l2:0.000127097+3.83581e-05"
[1] "[90]:  valid's l2:0.000123491+3.69218e-05"
[1] "[91]:  valid's l2:0.000119873+3.54353e-05"
[1] "[92]:  valid's l2:0.000116105+3.4529e-05"
[1] "[93]:  valid's l2:0.000113005+3.29312e-05"
[1] "[94]:  valid's l2:0.000110071+3.14197e-05"
[1] "[95]:  valid's l2:0.000107318+3.01238e-05"
[1] "[96]:  valid's l2:0.00010479+2.94182e-05"
[1] "[97]:  valid's l2:0.000102076+2.87784e-05"
[1] "[98]:  valid's l2:0.000100284+2.76207e-05"
[1] "[99]:  valid's l2:9.81008e-05+2.6275e-05"
[1] "[100]:  valid's l2:9.56005e-05+2.55846e-05"

So I think more details about the versions of R, RTools can be helpful to identify the cause.

dfalbel commented 3 years ago

Hi @shiyu1994 thanks for taking a look at this.

Here's the sessionInfo() of the system I can reproduce the error:

Loading required package: R6
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0          parsnip_0.1.6    

loaded via a namespace (and not attached):
 [1] lattice_0.20-44   tidyr_1.1.3       fansi_0.5.0       utf8_1.2.1       
 [5] crayon_1.4.1      dplyr_1.0.7       grid_4.1.0        jsonlite_1.7.2   
 [9] lifecycle_1.0.0   magrittr_2.0.1    pillar_1.6.1      rlang_0.4.11     
[13] data.table_1.14.0 Matrix_1.3-3      vctrs_0.3.8       generics_0.1.0   
[17] ellipsis_0.3.2    tools_4.1.0       glue_1.4.2        purrr_0.3.4      
[21] compiler_4.1.0    pkgconfig_2.0.3   tidyselect_1.1.1  tibble_3.1.2 

This is using master lightgbm too. Here's a link to the GHA run that reproduces the failure: https://github.com/curso-r/treesnip/runs/3116229035?check_suite_focus=true

dfsnow commented 3 years ago

I'm seeing the same issue. I'm guessing this may be related to #4007 and #4259. Some further details:

No crash

Running a clean install of the script below in a new project with renv enabled works for 3.2.1.99. See sessionInfo() below.

library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info ``` R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] lightgbm_3.2.1.99 R6_2.5.0 loaded via a namespace (and not attached): [1] compiler_4.1.0 Matrix_1.3-3 tools_4.1.0 grid_4.1.0 data.table_1.14.0 [6] jsonlite_1.7.2 renv_0.13.2 lattice_0.20-44 ```

Installing parsnip and loading it after lightgbm likewise does not result in a crash.

library(lightgbm)
library(parsnip)

data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info ``` R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] parsnip_0.1.7 lightgbm_3.2.1.99 R6_2.5.0 loaded via a namespace (and not attached): [1] magrittr_2.0.1 tidyselect_1.1.1 lattice_0.20-44 rlang_0.4.11 fansi_0.5.0 [6] dplyr_1.0.7 tools_4.1.0 hardhat_0.1.6 grid_4.1.0 data.table_1.14.0 [11] utf8_1.2.1 ellipsis_0.3.2 tibble_3.1.2 lifecycle_1.0.0 crayon_1.4.1 [16] Matrix_1.3-3 purrr_0.3.4 tidyr_1.1.3 vctrs_0.3.8 glue_1.4.2 [21] compiler_4.1.0 pillar_1.6.1 generics_0.1.0 jsonlite_1.7.2 renv_0.13.2 [26] pkgconfig_2.0.3 ```

Crash

However, loading parsnip before lightgbm results in a crash at the lgb.cv step.

library(parsnip)
library(lightgbm)

data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info ``` R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] lightgbm_3.2.1.99 R6_2.5.0 parsnip_0.1.7 loaded via a namespace (and not attached): [1] magrittr_2.0.1 tidyselect_1.1.1 lattice_0.20-44 rlang_0.4.11 fansi_0.5.0 [6] dplyr_1.0.7 tools_4.1.0 parallel_4.1.0 hardhat_0.1.6 grid_4.1.0 [11] data.table_1.14.0 utf8_1.2.1 ellipsis_0.3.2 tibble_3.1.2 lifecycle_1.0.0 [16] crayon_1.4.1 Matrix_1.3-3 purrr_0.3.4 tidyr_1.1.3 vctrs_0.3.8 [21] glue_1.4.2 compiler_4.1.0 pillar_1.6.1 generics_0.1.0 jsonlite_1.7.2 [26] renv_0.13.2 pkgconfig_2.0.3 ```

Notes

Edit

Did a quick trip through the Imports of parsnip, loading each library before lightgbm 1-by-1. The following libraries cause crashes:

dplyr (1.0.7)
hardhat (0.1.6)
tibble (3.1.2)
tidyr (1.1.3)

While the following cause no issues:

generics (0.1.0)
globals (0.14.0)
glue (1.4.2)
lifecycle (1.0.0)
magrittr (2.0.1)
prettyunits (1.1.1)
purrr (0.3.4)
rlang (0.4.11)
stats
utils
vctrs (0.3.8)

I then traveled through the dependencies of tibble and dplyr to find the lowest level library call that will cause a crash. Seems like fansi may be the actual culprit. The script below causes a crash for me in a fresh environment with lightgbm 3.2.1 (from CRAN) and 3.2.1.99 (from GitHub)

library(fansi)
library(lightgbm)

data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info ``` R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] lightgbm_3.2.1.99 R6_2.5.0 fansi_0.5.0 loaded via a namespace (and not attached): [1] magrittr_2.0.1 tidyselect_1.1.1 lattice_0.20-44 rlang_0.4.11 stringr_1.4.0 [6] dplyr_1.0.6 tools_4.1.0 grid_4.1.0 parallel_4.1.0 data.table_1.14.0 [11] audio_0.1-7 utf8_1.2.1 DBI_1.1.1 ellipsis_0.3.2 assertthat_0.2.1 [16] tibble_3.1.2 lifecycle_1.0.0 crayon_1.4.1 Matrix_1.3-3 beepr_1.3 [21] purrr_0.3.4 vctrs_0.3.8 glue_1.4.2 ccao_0.5.1 stringi_1.6.2 [26] compiler_4.1.0 pillar_1.6.1 generics_0.1.0 jsonlite_1.7.2 pkgconfig_2.0.3 ```
jameslamb commented 3 years ago

Thanks to everyone participating for your help and investigation!

I am planning to test some theories about this tomorrow when I have some time and easy access to a Windows environment.


Here's a link to the GHA run that reproduces the failure

@dfalbel , I looked at the definition of that GHA job (https://github.com/curso-r/treesnip/actions/runs/1049626089/workflow). I noticed that there's a call of remotes::install_deps() in a stage earlier than the Install dev lightgbm step. Since {lightgbm} is a dependency of {treesnip}, that step is going to install {lightgbm} from CRAN.

Seen in the logs for that step: https://github.com/curso-r/treesnip/runs/3116229035?check_suite_focus=true.

trying URL 'https://cloud.r-project.org/bin/windows/contrib/4.1/lightgbm_3.2.1.zip' Content type 'application/zip' length 3335183 bytes (3.2 MB)

downloaded 3.2 MB

If possible, could you try adding remove.packages("lightgbm") to the beginning of the install dev lightgbm step in that GHA job, and let me know if the issue still persists? I'm wondering if there's something being left behind from the CRAN install that is conflicting with the build from source.


@dfalbel and @dfsnow if you have time, could you also confirm whether you are using RStudio, and if so, whether your examples also produce this issue when that code is stored in a script and run with Rscript --vanilla test-code.R?

I suspect that both of you are using RStudio but @shiyu1994 did not in https://github.com/microsoft/LightGBM/issues/4464#issuecomment-883023409, so I'd like to see if that is relevant.

dfalbel commented 3 years ago

Hi @jameslamb, thanks for looking at this!

I have added the remove.packages("lightgbm") call and the error still persists: https://github.com/curso-r/treesnip/runs/3154997458?check_suite_focus=true#step:11:50 I think install.packages ultimately always removes the existing package folder before installing the package again.

For the second question, I can confirm that error happens on both RStudio and on a vanilla R session:

$ Rscript --vanilla R/test.R
Loading required package: R6
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0          parsnip_0.1.6

loaded via a namespace (and not attached):
 [1] lattice_0.20-44   tidyr_1.1.3       fansi_0.5.0       utf8_1.2.1
 [5] crayon_1.4.1      dplyr_1.0.7       grid_4.1.0        jsonlite_1.7.2
 [9] lifecycle_1.0.0   magrittr_2.0.1    pillar_1.6.1      rlang_0.4.11
[13] data.table_1.14.0 Matrix_1.3-3      vctrs_0.3.8       generics_0.1.0
[17] ellipsis_0.3.2    tools_4.1.0       glue_1.4.2        purrr_0.3.4
[21] compiler_4.1.0    pkgconfig_2.0.3   tidyselect_1.1.1  tibble_3.1.2
Segmentation fault 
jameslamb commented 3 years ago

I was able to reproduce this issue today using the latest master of LightGBM.

environment info and install instructions (click me) I installed `{lightgbm}` from source on Windows 10 like so: ```shell git clone --recursive git@github.com:microsoft/LightGBM.git cd LightGBM Rscript --vanilla -e "remove.packages('lightgbm')" Rscript --vanilla -e "install.packages(c('R6', 'data.table', 'jsonlite'), repos = 'https://cran.r-project.org')" Rscript --vanilla -e "install.packages(c('fansi'), repos = 'https://cran.r-project.org')" sh build-cran-package.sh R CMD INSTALL lightgbm_3.2.1.99.tar.gz ``` I'm using Rtools40 downloaded on May 9, 2020, so not the newest one. As far as I can tell from https://cran.r-project.org/bin/windows/Rtools/history.html, it isn't possible to access previous versions of Rtools. Output of `sessionInfo()`. ```shell Rscript -e "library(fansi); library(lightgbm); sessionInfo()" ``` ```text Loading required package: R6 R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17763) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] lightgbm_3.2.1.99 R6_2.5.0 fansi_0.5.0 loaded via a namespace (and not attached): [1] compiler_4.1.0 Matrix_1.3-3 grid_4.1.0 data.table_1.14.0 [5] jsonlite_1.7.2 lattice_0.20-44 ```

Thanks to the helpful contributions of @dfalbel and @dfsnow so far, I was able to reduce this to an even smaller reproducible example, cutting out lgb.cv().

Running the script below with Rscript --vanilla test.R produces a segfault when dtrain$construct() is called.

library(fansi)
library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
dtrain$construct()

Next, I'll add a ton of logging to dataset construction to try to narrow down the issue further. I also plan to inspect fansi::.onLoad() and fansi::.onAttach(). Updates to follow!

jameslamb commented 3 years ago

Alright I added a bunch of log statements and I think I've narrowed down the place where this segfault is being thrown.

I'm able to reproduce the issue on master using this further-simplified example that uses a standard R matrix instead of loading the agaricus dataset.

library(fansi)
library(lightgbm)

dtrain <- lgb.Dataset(
    data = matrix(rnorm(1000), nrow = 100)
    , label = rnorm(100)
)
dtrain$construct()

The segfault is being thrown from calls to Network::num_machines() in the Dataset loader. Right here:

https://github.com/microsoft/LightGBM/blob/fdc582ea6ba13faf15ee6707c7c7542790c8821d/src/io/dataset_loader.cpp#L626

I pushed a branch with all the extra logging and with those calls skipped. https://github.com/jameslamb/LightGBM/tree/misc/investigating-dataset-segfault. On that branch, the reproducible example runs successfully and does not produce a segfault.

Next, I'm going to try to figure out how the behavior of this code is changed by loading {fansi} and by the order of package loading.

jameslamb commented 3 years ago

I'm convinced that the root of the problem is related to the way that R loads DLLs, and that @dfsnow is right that {lightgbm} and {fansi} are in conflict with each other somehow.

If {dplyr} is loaded before {lightgbm} but then the fansi DLL is unloaded before loading {lightgbm}, the reproducible example does not produce a segfault, and Dataset construction succeeds.

library(dplyr)
dyn.unload(file.path(.libPaths()[1], "fansi", "libs", "x64", "fansi.dll"))
library(lightgbm)
dtrain <- lgb.Dataset(
    data = matrix(rnorm(1000), nrow = 100)
    , label = rnorm(100)
)
dtrain$construct()

If {fansi}'s DLL is unloaded after loading {lightgbm}, that script produces a segfault at dtrain$construct().

This finding plus the finding from https://github.com/microsoft/LightGBM/issues/4464#issuecomment-886244523 that commenting out Network::num_machines() causes Dataset construction to succeed has led me to this working theory:

Something in {fansi}'s DLL conflicts with lightgbm.dll or IPHLPAPI.DLL or WS2_32.dll (two libraries linked in with {lightgbm} to support distributed training).

I'm going to investigate this more closely with dumpbin and listdlls to see if I can identify the conflicts. I'm also going to try changing some details of {fansi} based on the advice in "Writing R Extensions", especially https://cran.r-project.org/doc/manuals/R-exts.html#Controlling-visibility.

Updates to follow!

jameslamb commented 3 years ago

Just to rule out another possibility like "loading any other package with compiled code before {lightgbm} is problematic"...I tried loading some other packages with compiled code before {lightgbm} and trying to construct a Dataset. These did not produce a segfault or any other issues.

I attempted {data.table} and {RPostgreSQL}, and checked that those packages' DLLs were loaded by running getLoadedDLLs().

jameslamb commented 3 years ago

@dfalbel @dfsnow thanks very much for your patience. I think I found the problem and have a fix up. Could you please try installing from my branch and let me know if it seems to resolve the issue?

git clone --recursive https://github.com/microsoft/LightGBM.git --branch fix/network-setup
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz
dfsnow commented 3 years ago

@jameslamb Your branch works for me. Fixes both {parsnip} and {fansi} with the following test script:

library(fansi)
library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression", 
    metric = "l2"
  ) , 
  data = dtrain
)

Also works with renv. Thanks for the quick turnaround!

dfalbel commented 3 years ago

Hey @jameslamb ! Thanks very much for the investigation and fix. I can confirm that this works great!

vidarsumo commented 3 years ago

@dfalbel @dfsnow thanks very much for your patience. I think I found the problem and have a fix up. Could you please try installing from my branch and let me know if it seems to resolve the issue?

git clone --recursive https://github.com/microsoft/LightGBM.git --branch fix/network-setup
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

When I run this

git clone --recursive https://github.com/microsoft/LightGBM.git --branch fix/network-setup
cd LightGBM
sh build-cran-package.sh

I get file not found

Removing files not needed for CRAN
Removing unknown pragmas in headers
File not found - *.h
File not found - *.h.bak
jameslamb commented 3 years ago

Some versions of the unix tools for Windows might have slightly different behavior. Can you try commenting out the uses of find in build-cran-package.sh?

vidarsumo commented 3 years ago

I commented this out (if I understood you correctly) find . -name '*.h.bak' -o -name '*.hpp.bak' -o -name '*.cpp.bak' -exec rm {} \;

Then I ran sh build-cran-pacakge.sh and got

Removing files not needed for CRAN
Removing unknown pragmas in headers
File not found - *.h
Changing lib_lightgbm to lightgbm
Cleaning sed backup files
* checking for file 'lightgbm_r/DESCRIPTION' ... OK
* preparing 'lightgbm':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* looking to see if a 'data/datalist' file should be added
* building 'lightgbm_3.2.1.99.tar.gz'
Warning: file 'lightgbm/cleanup' did not have execute permissions: corrected
Warning: file 'lightgbm/configure' did not have execute permissions: corrected

I tried to run this R CMD INSTALL lightgbm_3.2.1.99.tar.gzbut got:

* installing to library 'C:/Users/vidar/Documents/R/win-library/4.0'
* installing *source* package 'lightgbm' ...
** using staged installation
checking whether MM_PREFETCH works...no
checking whether MM_MALLOC works...no
** libs

*** arch - i386
C:/rtools40/usr/mingw_32/bin/g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-4.0.1/include" -DNDEBUG -I./include -DEIGEN_MPL2_ONLY -DUSE_SOCKET -DLGB_R_BUILD      -fopenmp -pthread   -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign -c boosting/boosting.cpp -o boosting/boosting.o
sh: C:/rtools40/usr/mingw_32/bin/g++: No such file or directory
make: *** [C:/PROGRA~1/R/R-4.0.1/etc/i386/Makeconf:229: boosting/boosting.o] Error 127
ERROR: compilation failed for package 'lightgbm'
* removing 'C:/Users/vidar/Documents/R/win-library/4.0/lightgbm'
* restoring previous 'C:/Users/vidar/Documents/R/win-library/4.0/lightgbm'
jameslamb commented 3 years ago

C:/rtools40/usr/mingw_32/bin/g++: No such file or directory

That doesn't look specific to {lightgbm}. I expect that if you run install.packages("xgboost", type = "source", repos = "https://cran.r-project.org") (for example), you will hit a similar error.

When using R 4.x on Windows, if you plan to install packages from source it's expected that you have installed Rtools (click here to get it) at C:/rtools40. You may not have encountered Rtools before if you are not a package developer and have only installed packages from CRAN, since CRAN publishes precompiled packages for Windows.

jameslamb commented 3 years ago

Also, since I just noticed that error message is about building the 32-bit version of the library (arch - i386).

If you DO have Rtools installed but during the installation you chose to only install the 64-bit components, then use R CMD INSTALL --no-multiarch to skip building the 32-bit version of {lightgbm}.

vidarsumo commented 3 years ago

I do have Rtools 4.0 installed but there is no mingw_64 folder under C:/rtools40/usr/ I tried R CMD INSTALL lightgbm_3.2.1.99.tar.gz --no-multiarch and got this error: C:/rtools40/usr/mingw_64/bin/g++: No such file or directory

/mingw_64/bin/g++ does exist but not under /usr/. It's located in the root /rtools40/

jameslamb commented 3 years ago

oh! I see now. I think you might have downloaded RTools35 and installed it in directory C:/rtools40, and that might be failing because you're mixing R 4.x and Rtools35.

Rtools35 has folders (from the root of Rtools) named mingw_32/ and mingw_64, while Rtools40 has MinGW stuff in /usr/mingw32 and /usr/mingw64.

I know this for sure because we use those paths in this project's CI

https://github.com/microsoft/LightGBM/blob/dc09d1b41f95f413a4dcb648478fdebf862c49ea/.ci/test_r_package_windows.ps1#L68-L73

https://github.com/microsoft/LightGBM/blob/dc09d1b41f95f413a4dcb648478fdebf862c49ea/.ci/test_r_package_windows.ps1#L76-L79

You might need to visit https://cran.r-project.org/bin/windows/Rtools/ and get the newest version of Rtools.

And you might find some of the discussion about similar issues (a path for Rtools being assumed and hard-coded into some versions of R) at https://stackoverflow.com/questions/39090983/rcpp-rtools-installed-but-error-message-g-not-found.

Can you also please try installing another package requiring compilation from source?

Rscript -e "install.packages('data.table', type = 'source', repos = 'https://cran.r-project.org')"

I expect you'll experience this same problem doing that, and if you do then I think that would confirm that this isn't an issue with {lightgbm} specifically but with your local setup generally.

vidarsumo commented 3 years ago

After solving a problem related to rtools everything works now :)

This was installed for R-4.0.x even though I have R-4.1.x installed. Is this not supported for R 4.1.x?

jameslamb commented 3 years ago

This was installed for R-4.0.x even though I have R-4.1.x installed. Is this not supported for R 4.1.x?

I'm not sure what you mean by this statement, sorry. If you have multiple versions of R on your system, please examine the PATH environment variable to see which version(s) are on PATH and in which order.

You might also try the following from a command prompt to inspect which version of R is first on your PATH.

# version of R
Rscript --version

# where the R executables are
Rscript -e "print(R.home())"

# where packages will be installed to / loaded from
Rscript -e "print(.libPaths())"
vidarsumo commented 3 years ago

I have multiple versions. 4.0.1 was first on PATH. And running Rscript --version gave this:

Rscript --version
R scripting front-end version 4.0.1 (2020-06-06)

Didn't know about it. Thanks for the help.

jameslamb commented 3 years ago

Now that #4496 has been merged, I believe this issue has been resolved.

Thanks so much to everyone involved here for your help with reproducible examples and debugging ideas!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.