Closed rnmourao closed 4 years ago
What are the values of features
and label
?
Also, for the data in that link, the columns are all numeric. It would be helpful to have the code that you use to go between the file and the data used.
df <- read.csv(url("https://github.com/rnmourao/r_3.6.1-caret-classificacao/raw/master/dados/train.csv"))
str(df)
#> 'data.frame': 980 obs. of 48 variables:
#> $ quantidadeMesContaCorrente : num 0.118 0.559 0.294 0.471 0.118 ...
#> $ valorSolicitadoCredito : num 0.102 0.42 0.142 0.369 0.155 ...
#> $ valorTaxaComprometimentoRenda : num 0.333 0.333 0.667 0.333 0.333 ...
#> $ textoAnoResidencia : num 0.667 1 1 0.333 1 ...
#> $ numeroIdade : num 0.536 0.464 0.607 0.286 0.75 ...
#> $ quantidadeCreditoAnterior : num 0 0 0 0 0 ...
#> $ quantidadeAvalista : int 1 1 0 0 0 0 0 1 0 1 ...
#> $ indicadorPosseTelefone : int 0 0 0 1 0 0 0 1 1 0 ...
#> $ indicadorTrabalhadorEstrangeiro : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ indicadorInadimplente : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoSaldoContaCorrente_X.....0. : int 0 1 0 0 0 1 1 0 0 1 ...
#> $ textoSaldoContaCorrente_X.0..200. : int 0 0 0 1 0 0 0 0 0 0 ...
#> $ textoSaldoContaCorrente_X.200.... : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoHistoricoCredito_conta.critica...outros.creditos.existentes..nao.neste.banco. : int 1 0 0 0 0 0 0 0 1 0 ...
#> $ textoHistoricoCredito_historico.de.atrasos : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoHistoricoCredito_sem.emprestimos.anteriores...todos.os.creditos.anteriores.pagos.em.dia.: int 0 0 0 0 0 0 1 0 0 0 ...
#> $ textoHistoricoCredito_todos.os.creditos.neste.banco.pagos.em.dia : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoFinalidadeCredito_carro..novo. : int 0 0 0 0 0 1 0 0 1 0 ...
#> $ textoFinalidadeCredito_carro..usado. : int 0 0 0 1 0 0 0 0 0 0 ...
#> $ textoFinalidadeCredito_educacao : int 1 0 0 0 0 0 0 0 0 0 ...
#> $ textoFinalidadeCredito_eletrodomesticos : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoFinalidadeCredito_moveis.equipamento : int 0 1 1 0 0 0 0 0 0 0 ...
#> $ textoFinalidadeCredito_negocios : int 0 0 0 0 0 0 1 0 0 0 ...
#> $ textoFinalidadeCredito_reciclagem.educacional : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoFinalidadeCredito_reforma : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoInvestimento_X.100..500. : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoInvestimento_X.1000.... : int 0 0 0 0 1 0 0 0 0 0 ...
#> $ textoInvestimento_X.500..1000. : int 0 0 1 0 0 0 0 1 0 1 ...
#> $ textoInvestimento_X0 : int 0 0 0 0 0 0 1 0 0 0 ...
#> $ textoAnoEmprego_X.0..1. : int 0 0 0 0 0 0 1 0 0 0 ...
#> $ textoAnoEmprego_X.4..7. : int 1 1 0 0 1 0 0 0 0 0 ...
#> $ textoAnoEmprego_X.7.... : int 0 0 1 0 0 0 0 1 0 0 ...
#> $ textoAnoEmprego_X0 : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoSexoEstadoCivil_homem..casado.viuvo : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoSexoEstadoCivil_homem..divorciado.separado : int 0 0 0 0 1 0 0 0 0 0 ...
#> $ textoSexoEstadoCivil_mulher..divorciado.separado.casado : int 0 0 0 0 0 1 0 0 0 0 ...
#> $ indicadorAvalista_cofiador : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ indicadorAvalista_fiador : int 0 1 0 0 0 0 0 0 0 0 ...
#> $ textoGarantia_X : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoGarantia_imovel : int 1 0 0 0 1 0 0 0 0 1 ...
#> $ textoGarantia_investimentos...seguro.de.vida : int 0 1 1 0 0 0 0 0 0 0 ...
#> $ textoOutroCredito_banco : int 0 0 0 0 0 0 1 0 0 0 ...
#> $ textoOutroCredito_lojas : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoNaturezaResidencia_alugado : int 0 0 0 1 0 1 0 0 0 1 ...
#> $ textoNaturezaResidencia_de.favor : int 0 1 0 0 0 0 0 0 0 0 ...
#> $ textoEmprego_desempregado.empregado.nao.especializado...nao.residente : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ textoEmprego_empregado.nao.especializado...residente : int 1 0 0 0 1 0 0 0 0 0 ...
#> $ textoEmprego_gerente.autonomo.empregado.altamente.especializado.forcas.armadas : int 0 0 0 1 0 0 0 0 0 0 ...
Created on 2020-01-02 by the reprex package (v0.3.0)
Hi Max,
The label is indicadorInadimplente. I used all other attributes as features.
This commit has the error.
_01preparacao.ipynb has data preparation. _02modelagem.ipynb has modeling and RFE.
This job required me to write all the explanations and data in Portuguese. However, I believe the flow of notebooks is quite straightforward. If you have any doubts, please contact me.
I can't reproduce it:
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
df <-
read.csv(
url(
"https://github.com/rnmourao/r_3.6.1-caret-classificacao/raw/master/dados/train.csv"
)
)
df$indicadorInadimplente <- factor(df$indicadorInadimplente)
set.seed(364525)
lrRFE <- rfe(
x = df[, names(df) != "indicadorInadimplente"],
y = df$indicadorInadimplente,
sizes = c(1:10, 15, 30),
rfeControl = rfeControl(functions = lrFuncs, method = "cv")
)
lrRFE
#>
#> Recursive feature selection
#>
#> Outer resampling method: Cross-Validated (10 fold)
#>
#> Resampling performance over subset size:
#>
#> Variables Accuracy Kappa AccuracySD KappaSD Selected
#> 1 0.6143 0.2286 0.03395 0.06789
#> 2 0.6765 0.3531 0.03504 0.07007
#> 3 0.6745 0.3490 0.03549 0.07099
#> 4 0.6673 0.3347 0.03270 0.06539
#> 5 0.6531 0.3061 0.05158 0.10317
#> 6 0.6541 0.3082 0.05861 0.11722
#> 7 0.6480 0.2959 0.05777 0.11555
#> 8 0.6469 0.2939 0.06145 0.12290
#> 9 0.6582 0.3163 0.06221 0.12442
#> 10 0.6684 0.3367 0.04595 0.09190
#> 15 0.6837 0.3673 0.04614 0.09228
#> 30 0.7194 0.4388 0.03826 0.07651
#> 47 0.7255 0.4510 0.05323 0.10646 *
#>
#> The top 5 variables (out of 47):
#> textoSaldoContaCorrente_X.....0., textoSaldoContaCorrente_X.0..200., valorSolicitadoCredito, textoSaldoContaCorrente_X.200...., textoInvestimento_X.1000....
# or
lrRFE <- rfe(
indicadorInadimplente ~ .,
data = df,
sizes = c(1:10, 15, 30),
rfeControl = rfeControl(functions = lrFuncs, method = "cv")
)
lrRFE
#>
#> Recursive feature selection
#>
#> Outer resampling method: Cross-Validated (10 fold)
#>
#> Resampling performance over subset size:
#>
#> Variables Accuracy Kappa AccuracySD KappaSD Selected
#> 1 0.6143 0.2286 0.04634 0.09268
#> 2 0.6765 0.3531 0.04083 0.08166
#> 3 0.6735 0.3469 0.03967 0.07933
#> 4 0.6663 0.3327 0.04139 0.08279
#> 5 0.6663 0.3327 0.03820 0.07639
#> 6 0.6561 0.3122 0.04462 0.08924
#> 7 0.6622 0.3245 0.04178 0.08357
#> 8 0.6561 0.3122 0.04930 0.09860
#> 9 0.6694 0.3388 0.04171 0.08343
#> 10 0.6765 0.3531 0.04410 0.08820
#> 15 0.6724 0.3449 0.04625 0.09250
#> 30 0.7296 0.4592 0.05573 0.11147
#> 47 0.7316 0.4633 0.05590 0.11180 *
#>
#> The top 5 variables (out of 47):
#> textoSaldoContaCorrente_X.....0., textoSaldoContaCorrente_X.0..200., valorSolicitadoCredito, textoSaldoContaCorrente_X.200...., textoInvestimento_X.1000....
set.seed(364525)
ldaRFE <- rfe(
x = df[, names(df) != "indicadorInadimplente"],
y = df$indicadorInadimplente,
sizes = c(1:10, 15, 30),
rfeControl = rfeControl(functions = ldaFuncs, method = "cv")
)
ldaRFE
#>
#> Recursive feature selection
#>
#> Outer resampling method: Cross-Validated (10 fold)
#>
#> Resampling performance over subset size:
#>
#> Variables Accuracy Kappa AccuracySD KappaSD Selected
#> 1 0.6143 0.2286 0.03395 0.06789
#> 2 0.6061 0.2122 0.04144 0.08287
#> 3 0.6337 0.2673 0.03709 0.07418
#> 4 0.6245 0.2490 0.04584 0.09167
#> 5 0.6347 0.2694 0.04609 0.09218
#> 6 0.6531 0.3061 0.03695 0.07390
#> 7 0.6724 0.3449 0.04261 0.08521
#> 8 0.6878 0.3755 0.03794 0.07587
#> 9 0.6969 0.3939 0.03968 0.07936
#> 10 0.7000 0.4000 0.03670 0.07339
#> 15 0.7010 0.4020 0.03006 0.06012
#> 30 0.7347 0.4694 0.03535 0.07070 *
#> 47 0.7265 0.4531 0.03495 0.06991
#>
#> The top 5 variables (out of 30):
#> textoSaldoContaCorrente_X.....0., quantidadeMesContaCorrente, textoHistoricoCredito_conta.critica...outros.creditos.existentes..nao.neste.banco., textoSaldoContaCorrente_X.0..200., numeroIdade
# or
ldaRFE <- rfe(
indicadorInadimplente ~ .,
data = df,
sizes = c(1:10, 15, 30),
rfeControl = rfeControl(functions = ldaFuncs, method = "cv")
)
ldaRFE
#>
#> Recursive feature selection
#>
#> Outer resampling method: Cross-Validated (10 fold)
#>
#> Resampling performance over subset size:
#>
#> Variables Accuracy Kappa AccuracySD KappaSD Selected
#> 1 0.6143 0.2286 0.04634 0.09268
#> 2 0.6071 0.2143 0.03915 0.07831
#> 3 0.6429 0.2857 0.03818 0.07636
#> 4 0.6286 0.2571 0.04594 0.09187
#> 5 0.6439 0.2878 0.05452 0.10903
#> 6 0.6612 0.3224 0.04659 0.09317
#> 7 0.6714 0.3429 0.03593 0.07186
#> 8 0.6786 0.3571 0.03341 0.06683
#> 9 0.6949 0.3898 0.04288 0.08575
#> 10 0.6837 0.3673 0.05022 0.10044
#> 15 0.6969 0.3939 0.03880 0.07759
#> 30 0.7306 0.4612 0.04388 0.08775
#> 47 0.7316 0.4633 0.04787 0.09575 *
#>
#> The top 5 variables (out of 47):
#> textoSaldoContaCorrente_X.....0., quantidadeMesContaCorrente, textoHistoricoCredito_conta.critica...outros.creditos.existentes..nao.neste.banco., numeroIdade, textoGarantia_X
Created on 2020-01-02 by the reprex package (v0.3.0)
It worked! Thanks! I'll check my code again.
Hi,
I can't use RFE for Logistic Regression (lrFuncs or caretFuncs + glm):
Output:
Error in {: task 1 failed - "undefined columns selected"
The same code using ldaFuncs works well:
My sessionInfo():
A sample dataset (some warning messages about collinearity appear due the sample size...if you want to reproduce the error without these warnings, please use the entire set, at https://github.com/rnmourao/r_3.6.1-caret-classificacao/blob/master/dados/train.csv):