Open philipp-baumann opened 5 years ago
Or is it an issue in the original Rulequest source code and escaping for values is not supported, despite being mentioned in the Rulequest overview webpage?
Yeah, it's their original limitation.
Do you want to PR or should I just make the change?
Hi @topepo sorry just saw now, it's been long ago. If you have time just making the change, would be great! otherwise, I guess if this is an original limitation, a note would also suffice in the README. Thanks for making this pkg. Cheers
Hi Max,
some colleagues observed that
caret::train()
withmethod = "cubist"
errors when some special characters in factor values are present in predictors, tracing back toCubist::cubist()
.I really like Cubist because of its speed and straight-forward way of interpreting results. Thanks a lot for your energy invested in this nice and clean R port!
I thought I'll have a look into the issue to figure out a possible solution. Below is some testing of standard ASCII characters, some of them with special roles in Rulequest Cubist, and non-ASCII umlauts, to diagnose the errors, and a suggestion for resolving a part of the issue:
Based on the errors above, escaping of the following characters does not work:
","
,":"
,";"
,"|"
, ,"ä"
. However,"."
works fine. I was quite suprised, because according to this info page of Rulequest, escaping should work for comma, colon, period, and vertical bar. Here is is the output from the current escaping helper:My guess is that "." is not a problem because C Cubist parses the values in the data file correctly due to separation by comma, and escaping has no effect.
Here is the session info output:
I made a commit in the forked repo here to fix a part of the issues here
The new
escapes()
helper only escapes","
,":"
, and"|"
. This resolves issues with umlaut parsing (nofixed = TRUE
ingsub()
), and Cubist now works when factorial variables contain these. This change letsCubist::cubist()
compute successfully for semicolon";"
character in factors, but unfortunately not for the remaining special characters. However, I was not able to figure out how to get escaping of","
,":"
, and"|"
working.I have no experience in C (yet). Do you have any ideas why escaping fails for these reserved Cubist characters? Or is it an issue in the original Rulequest source code and escaping for values is not supported, despite being mentioned in the Rulequest overview webpage? Maybe this is also locale specific and depends on the encoding conversion between C Cubist files and R objects.
Would be great to fully support escaping, because these special characters are quite common. If there is no easy solution, I think it would be helpful to include checks in
Cubist::cubist()
and let it error with an informative message when these characters are in factors or character columns of the predictor data frame.Thanks for your help, looking forward to your insight into this issue.
Cheers, Philipp