laresbernardo / lares

Analytics & Machine Learning R Sidekick
https://laresbernardo.github.io/lares/
233 stars 49 forks source link

Not a valid input: #28

Closed davidfgeorge closed 3 years ago

davidfgeorge commented 3 years ago

Good day to you. Very impressive package, which I am starting to familiarise myself with. Using corr_var with any existing data frame column name as the target variable to focus on, I receive: 'not a valid input: column_name was transformed or does not exist.' The function does however produce output. Am I getting something wrong or misunderstanding? Also I am not sure about the significance of the 'dummy' argument and the reference to 'dummy' in the resulting bar chart ? Many thanks, and hoping you stay well. David.

laresbernardo commented 3 years ago

Hi @davidfgeorge Glad to know you're starting to use the library! This message is shown when the var used to correlate is not numerical. What the function does is run one-hot-encoding to create dummy variables (1s and 0s) and selects one of them. So if you want to select one of the possible variables you should run something like corr_var(df, myvariable_category). Here's a reproducible example with the Titanic dataset and the passengers' classes (categorical):

library(lares)
data("dft")
corr_var(dft, Pclass, top = 10) # It will run correlations for Pclass_1

Note the messages displayed after showing the results (you may want to run corr_var(dft, Pclass_1, ...) instead to select the specific category you want.

Automatically using 'Pclass_1
Warning message:
In corr_var(dft, Pclass) : Maybe you meant one of: 'Pclass_1', 'Pclass_2'

Also note that the function ignores one of the categorical values as it will be redundant. In case you need corr_var to show all possible categories (instead of n-1), then use corr_var(df, ..., redundant = TRUE). In our case, it will show the 3 possible categories ("1", "2", "3").

davidfgeorge commented 3 years ago

Hi there Bernardo, I hope that you, your family and all those close to you are safe and well.

Thank you for replying so quickly.

All of the column data values are either ‘int’ or ‘num’, with no categoricals:

df_str(test_data, return = "names")

$cols
[1] "count_expertise_area"            "count_industry"                  "count_references"
 [4] "count_software_versions"         "count_sub_expertise_area"        "count_written_references"
 [7] "grade_communication"             "grade_documentation_quality"     "grade_knowledge_functional_area"
[10] "grade_quality_work"              "grade_speed_understanding"       "invite_score"
[13] "profile_score"                   "purchase_price"                  "reference_score"
[16] "skills_score"                    "stars"                           "total_score_percent"

$nums
[1] "count_expertise_area"            "count_industry"                  "count_references"
 [4] "count_software_versions"         "count_sub_expertise_area"        "count_written_references"
 [7] "grade_communication"             "grade_documentation_quality"     "grade_knowledge_functional_area"
[10] "grade_quality_work"              "grade_speed_understanding"       "invite_score"
[13] "profile_score"                   "purchase_price"                  "reference_score"
[16] "skills_score"                    "stars"                           "total_score_percent"

$char
character(0)

$factor
character(0)

$logic
character(0)

$time
character(0)

$allnas
character(0)
corr_var(test_data,   # name of dataset
         purchase_price, # name of variable to focus on
         top = 5                  # display top 5 correlations
)
Warning message:
In corr_var(test_data, purchase_price, top = 3) :
  Not a valid input: purchase_price was transformed or does not exist.

Any ideas? Many thanks, David.

laresbernardo commented 3 years ago

Could you please share with me the data (or a sample) you are using to help me debug? There must be an error somewhere but no idea where. I don’t see why exactly is the function behaving like that with your dataset.

El 1/11/2020, a las 1:24 p. m., David notifications@github.com escribió:

Hi there Bernardo, I hope that you, your family and all those close to you are safe and well.

Thank you for replying so quickly.

All of the column data values are either ‘int’ or ‘num’, with no categoricals:

df_str(test_data, return = "names")

$cols [1] "count_expertise_area" "count_industry" "count_references" [4] "count_software_versions" "count_sub_expertise_area" "count_written_references" [7] "grade_communication" "grade_documentation_quality" "grade_knowledge_functional_area" [10] "grade_quality_work" "grade_speed_understanding" "invite_score" [13] "profile_score" "purchase_price" "reference_score" [16] "skills_score" "stars" "total_score_percent"

$nums [1] "count_expertise_area" "count_industry" "count_references" [4] "count_software_versions" "count_sub_expertise_area" "count_written_references" [7] "grade_communication" "grade_documentation_quality" "grade_knowledge_functional_area" [10] "grade_quality_work" "grade_speed_understanding" "invite_score" [13] "profile_score" "purchase_price" "reference_score" [16] "skills_score" "stars" "total_score_percent"

$char character(0)

$factor character(0)

$logic character(0)

$time character(0)

$allnas character(0)

corr_var(test_data, # name of dataset purchase_price, # name of variable to focus on top = 5 # display top 5 correlations ) Warning message: In corr_var(test_data, purchase_price, top = 3) : Not a valid input: purchase_price was transformed or does not exist.

Any ideas? Many thanks, David.

David F. George Need little, want less, give more davidfgeorge@hotmail.commailto:davidfgeorge@hotmail.com LinkedInhttp://uk.linkedin.com/in/dfgeorge Mobile: +44 7468 858638

From: laresbernardo notifications@github.com Reply-To: laresbernardo/lares reply@reply.github.com Date: Sunday, 1 November 2020 at 15:52 To: laresbernardo/lares lares@noreply.github.com Cc: "Dr. David F George" davidfgeorge@hotmail.com, Mention mention@noreply.github.com Subject: Re: [laresbernardo/lares] Not a valid input: (#28)

Hi @davidfgeorgehttps://github.com/davidfgeorge Glad to know you're starting to use the library! This message is shown when the var used to correlate is not numerical. What the function does is run one-hot-encoding to create dummy variables (1s and 0s) and selects one of them. So if you want to select one of the possible variables you should run something like corr_var(df, myvariable_category). Here's a reproducible example with the Titanic dataset and the passengers' classes (categorical):

library(lares)

data("dft")

corr_var(dft, Pclass, top = 10) # It will run correlations for Pclass_1

Note the messages displayed after showing the results (you may want to run corr_var(dft, Pclass_1, ...) instead to select the specific category you want.

Automatically using 'Pclass_1

Warning message:

In corr_var(dft, Pclass) : Maybe you meant one of: 'Pclass_1', 'Pclass_2'

Also note that the function ignores one of the categorical values as it will be redundant. In case you need corr_var to show all possible categories (instead of n-1), then use corr_var(df, ..., redundant = TRUE). In our case, it will show the 3 possible categories ("1", "2", "3").

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/laresbernardo/lares/issues/28#issuecomment-720109431, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACUH6JZU3RO7IAZYF5RSEJDSNV7ZZANCNFSM4TGQYCRA. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/laresbernardo/lares/issues/28#issuecomment-720113753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFVOUPRIE4MYU4GVZTSTGTSNWDTBANCNFSM4TGQYCRA.

davidfgeorge commented 3 years ago

Please see the attached .csv file. Many thanks, David.

laresbernardo commented 3 years ago

Hi! Didn’t receive the attachment. Could you please upload it somewhere and share the link so I can download it? I think github does. It allow file attachments.

davidfgeorge commented 3 years ago

Hi. I can attach on github, but not .csv as not supported. I will attach .xlxs and .txt

davidfgeorge commented 3 years ago

Here you are. test_data.txt test_data.xlsx

laresbernardo commented 3 years ago

Hi @davidfgeorge I've just tested your file and no error nor warning was shown. The only issue I see is that it's showing 6 instead of 5 (I'll fix that on my side). To reproduce:

test_data <- lares::read.file("test_data.xlsx")
corr_var(test_data,   # name of dataset
         purchase_price, # name of variable to focus on
         top = 5                  # display top 5 correlations
)

Outcome:

Captura de Pantalla 2020-11-01 a la(s) 6 50 01 p  m
davidfgeorge commented 3 years ago

Well no idea what is happening: As per my original message I do get a result, but also this error message every time the function is run with any variable. corr_var(test_data, # name of dataset purchase_price, # name of variable to focus on top = 3 # display top 3 correlations ) Warning message: In corr_var(test_data, purchase_price, top = 3) : Not a valid input: purchase_price was transformed or does not exist.

[cid:image001.png@01D6B0EC.EC759E70] Looking at the code I can see the ‘if/else’ clause containing the message – looks like it will always be displayed?

davidfgeorge commented 3 years ago

Using identical data the error/warning message still occurs. Are you using a different version of the function? I am using 4.9.7 Please see my latest email. Thank you.

laresbernardo commented 3 years ago

Hi @davidfgeorge That warning message (it is not an error message is it?) only appears when the variable name is not textually present on your input. The logic behind runs as follows: if there's no variable (column name) containing that name, error. If there's a variable that was transformed (by one-hot-encoding) it will warn the user and select one automatically. If you select it manually with variable_category pattern, no warning or message will be shown. Check the code here. I'm trying one more thing on the version 4.9.8 to see if it gets fixed. I still have no idea what's happening in your computer as I can't replicate the issue; but we'll figure it out. Please, update the library and try again in a new fresh session. Also, share your sessionInfo() with me. Oh, and if you reply directly on Github you'll format your answers without posting your personal signature and contact information ;)

davidfgeorge commented 3 years ago

I can see that the code block '# Check if main variable exists' referenced here: https://github.com/laresbernardo/lares/blob/master/R/correlations.R#L204 is different from the code supplied in 4.9.7 which will explain the consistent warning message. I have installed 4.9.8 and do not receive the message now. corr_var(test_data, # name of dataset purchase_price, # name of variable to focus on top = 3 # display top 5 correlations ) image

Thanks, and stay well.