mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 405 forks source link

Task doesn´t accept foreign language #2568

Closed Hadsga closed 5 years ago

Hadsga commented 5 years ago

I have a data set with Russian column names:

`glimpse(dat)
Observations: 1,545,898
Variables: 43
$ year                  <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201...
$ month                 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
$ shop_id               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ item_category_id      <int> 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
$ item_id               <int> 16255, 5740, 5570, 5572, 5573, 5574, 5576, 56...
$ item_cnt_month        <dbl> 1.000000, 1.000000, 1.000000, 1.571429, 1.000...
$ item_cnt_month_lag    <dbl> NA, NA, NA, 1.666667, 1.000000, NA, 1.000000,...
$ item_price_lag        <dbl> NA, NA, NA, 1322, 560, NA, 2231, 2381, NA, 29...
$ item_cnt_month_lag2   <dbl> NA, NA, 1.177778, 1.177778, 1.177778, 1.17777...
$ item_price_lag2       <dbl> NA, NA, 1938.6889, 1938.6889, 1938.6889, 1938...
$ item_cnt_month_lag3   <dbl> 1.163781, 1.163781, 1.163781, 1.163781, 1.163...
$ item_price_lag3       <dbl> 531.262, 531.262, 531.262, 531.262, 531.262, ...
$ city.Адыгея           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Балашиха         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Волжский         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Вологда          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Воронеж          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Выездная         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Жуковский        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Интернет.магазин <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Казань           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Калуга           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Коломна          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Красноярск       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Курск            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Москва           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Мытищи           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Н.Новгород       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Новосибирск      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Омск             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.РостовНаДону     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Самара           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Сергиев          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.СПб              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Сургут           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Томск            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Тюмень           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Уфа              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Химки            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Цифровой         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Чехов            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ city.Якутск           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ city.Ярославль        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...`

If I try to create a task I get this error:

Error in makeTask(type = type, data = data, weights = weights, blocking = blocking, : Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
GegznaV commented 5 years ago

Some general notes: 1) Is your R locale set to Russian? If you are on Windows, you can use Sys.setlocale(locale = "Russian"); 2) If you are on Windows, is your Windows administrative locale set to Russian? (Sometimes changing the locale solves some problems). 3) Could you provide us with a minimal reproducible example of R code (please, follow these guidelines)? 4) You may try to transliterate your Cyrillic names into Latin ones, e.g., this example.

Hadsga commented 5 years ago
  1. I use Sys.setlocale("LC_ALL","Russian").
  2. I am not located in Russia, so I don´t want to change the region. However, for a short period, it´s no problem.
  3. 
    dput(sample_data)
    structure(list(city = c("Сергиев", "Новосибирск", "Красноярск", 
    "Химки", "Курск", "Москва", "Волжский", "Уфа", "Коломна", "Москва"
    ), item_cnt_month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -10L))

dummy_data = createDummyFeatures(sample_data, target = "item_cnt_month") task = makeRegrTask(data = dummy_data, target = "item_cnt_month")



4. That works, thanks.
GegznaV commented 5 years ago

I'm working in "English_United States.1252" R locale on Windows 10. With your example, I got a message like this:

Error in (function (cn, x)  : 
  Unsupported feature type (character) in column 'city'.

And lines solved the issue:

sample_data$city <- as.factor(sample_data$city)   # convert to factor
sample_data <- as.data.frame(sample_data)         # convert to pure data frame

The next problem was missing values in the target variable:

Error in mlr::makeRegrTask(data = dummy_data, target = "item_cnt_month") : 
  Assertion on 'item_cnt_month' failed: Contains missing values (element 7).

And this is not a language-dependent issue.

But if I switch to Russian locale, I can reproduce your issue. Have switching the administrative locale to Russian helped you? If you are working in the Russian language, it worth doing it.

Hadsga commented 5 years ago

My bad. In the original data city is converted into a factor and there are no NAs in the target variable. However, if I change administrative locale the column city looks like this (this is done without Sys.setlocale(locale = "Russian")) :

<U+0422><U+044E><U+043C><U+0435><U+043D><U+044C>

If I try to use Sys.setlocale(locale = "Russian") (i.e. without "LC_ALL") I get this error:

Error in Sys.setlocale("Russian") : invalid 'category' argument

I think the best option is to use your suggestion number 4.

GegznaV commented 5 years ago

I summarized the main aspects of the conversation above by this example. I turned my Windows locale to Russian. Then used R setup:

library(mlr)
Sys.setlocale(locale = "Russian")

If certain words, such as "Красноярск" (Krasnoyarsk city), are included as column names, as in this example:

dummy_data = structure(list(
  item_cnt_month   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
  city.Красноярск  = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0)
), 
class = "data.frame", 
row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))

task = makeRegrTask(data = dummy_data, target = "item_cnt_month")

Then makeRegrTask() fails with error:

Error in makeTask(type = type, data = data, weights = weights, blocking = blocking, :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.

If all dataset, provided by @Hadsga, is used except the line with "Красноярск", the code works as expected:

dummy_data = structure(list(
  item_cnt_month   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
  city.Волжский    = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
  city.Коломна     = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0), 
  # city.Красноярск  = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0),
  city.Курск       = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0), 
  city.Москва      = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 1),
  city.Новосибирск = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
  city.Сергиев     = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
  city.Уфа         = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0), 
  city.Химки       = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
  ), 
  class = "data.frame", 
  row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10"))

task = makeRegrTask(data = dummy_data, target = "item_cnt_month")

@pat-s, doesn't it seem like an encoding issue?

And doesn't it seem similar to Rapporter/pander#296 (to sum up: after R 3.4.0 was released, some encoding issues appeared and it was impossible to use the package with data in certain languages), which persisted for almost two years until it was solved by adding enc2native() in certain lines of code (see Rapporter/pander@06c2f65 for details). These are just my ideas.

pat-s commented 5 years ago

@GegznaV Thanks for all your time here. TBH, I have not much experience with encoding and dealing with non-latin characters is out of scope here for us.

If there is an easy canonical fix I am happy to take a look at it.

Closing here since the issue is not really related to mlr.