Encoding problem when downloading OpenML data set

ghost commented 10 years ago

I wanted to create a mlr task from a data set that I just uploaded to OpenML. Unfortunately, there seems to be a problem with some characters in the data set's description!?

Do I have to change the description or is there another way to fix this?

R Log: Downloading data set 'libras_move' from OpenML repository. Intermediate files (XML and ARFF) will be stored in : C:\Users\tob\AppData\Local\Temp\RtmpU183Lr Downloading file: C:\Users\tob\AppData\Local\Temp\RtmpU183Lr/data_set_desc_libras_move_v1.xml from: http://openml.org/api/?f=openml.data.description&data.id=299 Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x73 0x63 0x61 Error : 1: Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x73 0x63 0x61

Error in parseXMLResponse(file, "Getting data set description", "data_set_description") : Error in parsing XML for type data_set_description in file: C:\Users\tob\AppData\Local\Temp\RtmpU183Lr/data_set_desc_libras_move_v1.xml All intermediate XML and ARFF files are now removed.

traceback() 5: stop(obj) 4: stopf("Error in parsing XML for type %s in file: %s", type, file) 3: parseXMLResponse(file, "Getting data set description", "data_set_description") 2: parseOpenMLDataSetDescription(file = fn.data.set.desc) 1: downloadOpenMLDataAsMlrTask("libras_move", clean.up = TRUE)

sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 [4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] ROCR_1.0-5 gplots_2.13.0 plyr_1.8.1 cluster_1.15.2 BatchJobs_1.4 R.matlab_3.0.1
[7] fail_1.2 OpenML_1.0 DMwR_0.4.1 lattice_0.20-29 mlr_2.2 BBmisc_1.7
[13] ParamHelpers_1.3 devtools_1.5

loaded via a namespace (and not attached): [1] abind_1.4-0 bitops_1.0-6 brew_1.0-6 caTools_1.17 checkmate_1.2
[6] class_7.3-10 codetools_0.2-8 DBI_0.2-7 digest_0.6.4 evaluate_0.5.5
[11] gdata_2.13.3 gtools_3.4.1 httr_0.3 KernSmooth_2.23-12 memoise_0.2.1
[16] parallel_3.1.0 parallelMap_1.1 quantmod_0.4-0 R.methodsS3_1.6.1 R.oo_1.18.0
[21] R.utils_1.32.4 Rcpp_0.11.2 RCurl_1.95-4.1 rJava_0.9-6 rjson_0.2.14
[26] rpart_4.1-8 RSQLite_0.11.4 RWeka_0.4-23 RWekajars_3.7.11-1 sendmailR_1.1-2
[31] splines_3.1.0 stringr_0.6.2 survival_2.37-7 tools_3.1.0 whisker_0.3-2
[36] XML_3.98-1.1 xts_0.9-7 zoo_1.7-11

dominikkirchhoff commented 10 years ago

I can download the data and the task without any problem (using a Mac), so it seems to be an OS-related problem. I don't see another way than changing the description at the moment.

joaquinvanschoren commented 10 years ago

Also cannot reproduce the error (I have no Windows machines around). In any case, I have edited the description and removed some possibly offending characters...

Is this just a fluke, or will you have problems again with another dataset?

ghost commented 10 years ago

For me, the error occured for data sets "libras_move" and "yeast_ml8". For the first data set, the problem seems to be the letter "í" in one of the author's names. Is it already possible to edit the description?

joaquinvanschoren commented 10 years ago

Ok, after I read this: http://stackoverflow.com/questions/13495133/debugging-encoding-problems-r-xml It seems that it might be related to a difference between the encoding of the XML and the encoding used in the database. The database was using utf8_unicode. I changed that to utf8_bin. Maybe that helps (but I doubt it). Also see the link for a clue on how to solve it.

Yes, it is possible to edit the descriptions online (click the edit button), but that won't immediately affect the XML, since the changes are not immediately stored in the database. If you are still having problems, I'll make sure that the XML also includes recent changes tomorrow...

ghost commented 10 years ago

Hi Joaquin, thanks the links - after looking at http://stackoverflow.com/questions/13525539/how-to-retrieve-a-very-long-xml-string-from-an-sql-database-with-r?lq=1 I still don't know how to apply that to my problem. How should the connection and the query look like?

Anyway, actually I don't need the description at all - I just want the data. Would it be a solution just to download and transform the .arff-file via downloadOpenMLDataAsMlrTask(). Perhaps there could be another function in order to get the corresponding description.

Furthermore I've just adjusted the description via the new edit function - as soon as the database is updated, I can check again.

joaquinvanschoren commented 10 years ago

Editing the description online will immediately change what is returned by the API in XML. This means that problematic characters can be removed manually.

Is this a sufficient solution for this issue?

ghost commented 10 years ago

As far as I'm concerned that's absolutely sufficient. Thanks!

openml / openml-r

Encoding problem when downloading OpenML data set #24