Closed Hadsga closed 5 years ago
Some general notes:
1) Is your R locale set to Russian? If you are on Windows, you can use Sys.setlocale(locale = "Russian")
;
2) If you are on Windows, is your Windows administrative locale set to Russian? (Sometimes changing the locale solves some problems).
3) Could you provide us with a minimal reproducible example of R code (please, follow these guidelines)?
4) You may try to transliterate your Cyrillic names into Latin ones, e.g., this example.
Sys.setlocale("LC_ALL","Russian")
.
dput(sample_data)
structure(list(city = c("Сергиев", "Новосибирск", "Красноярск",
"Химки", "Курск", "Москва", "Волжский", "Уфа", "Коломна", "Москва"
), item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
dummy_data = createDummyFeatures(sample_data, target = "item_cnt_month") task = makeRegrTask(data = dummy_data, target = "item_cnt_month")
4. That works, thanks.
I'm working in "English_United States.1252" R locale on Windows 10. With your example, I got a message like this:
Error in (function (cn, x) :
Unsupported feature type (character) in column 'city'.
And lines solved the issue:
sample_data$city <- as.factor(sample_data$city) # convert to factor
sample_data <- as.data.frame(sample_data) # convert to pure data frame
The next problem was missing values in the target variable:
Error in mlr::makeRegrTask(data = dummy_data, target = "item_cnt_month") :
Assertion on 'item_cnt_month' failed: Contains missing values (element 7).
And this is not a language-dependent issue.
But if I switch to Russian locale, I can reproduce your issue. Have switching the administrative locale to Russian helped you? If you are working in the Russian language, it worth doing it.
My bad. In the original data city
is converted into a factor and there are no NA
s in the target variable.
However, if I change administrative locale the column city looks like this (this is done without Sys.setlocale(locale = "Russian")
) :
<U+0422><U+044E><U+043C><U+0435><U+043D><U+044C>
If I try to use Sys.setlocale(locale = "Russian")
(i.e. without "LC_ALL"
) I get this error:
Error in Sys.setlocale("Russian") : invalid 'category' argument
I think the best option is to use your suggestion number 4.
I summarized the main aspects of the conversation above by this example. I turned my Windows locale to Russian. Then used R setup:
library(mlr)
Sys.setlocale(locale = "Russian")
If certain words, such as "Красноярск" (Krasnoyarsk city), are included as column names, as in this example:
dummy_data = structure(list(
item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
city.Красноярск = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0)
),
class = "data.frame",
row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))
task = makeRegrTask(data = dummy_data, target = "item_cnt_month")
Then makeRegrTask()
fails with error:
Error in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
If all dataset, provided by @Hadsga, is used except the line with "Красноярск", the code works as expected:
dummy_data = structure(list(
item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
city.Волжский = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
city.Коломна = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0),
# city.Красноярск = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0),
city.Курск = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0),
city.Москва = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
city.Новосибирск = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
city.Сергиев = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
city.Уфа = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0),
city.Химки = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
),
class = "data.frame",
row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))
task = makeRegrTask(data = dummy_data, target = "item_cnt_month")
@pat-s, doesn't it seem like an encoding issue?
And doesn't it seem similar to Rapporter/pander#296 (to sum up: after R 3.4.0 was released, some encoding issues appeared and it was impossible to use the package with data in certain languages), which persisted for almost two years until it was solved by adding enc2native()
in certain lines of code (see Rapporter/pander@06c2f65 for details). These are just my ideas.
@GegznaV Thanks for all your time here. TBH, I have not much experience with encoding and dealing with non-latin characters is out of scope here for us.
If there is an easy canonical fix I am happy to take a look at it.
Closing here since the issue is not really related to mlr.
I have a data set with Russian column names:
If I try to create a task I get this error: