mlr-org / farff

a faster arff parser
Other
11 stars 6 forks source link

Encoding bug? #28

Open giuseppec opened 8 years ago

giuseppec commented 8 years ago

There seem to be an encoding issue at least for windows (not sure if this is because of windows java or windows or farff):

    oml.conf = getOMLConfig()
    cachedir = oml.conf$cachedir
    data.id = 376
    data.reader = "readr"
    getOMLDataSet(data.id)
    path = file.path(cachedir, "datasets", data.id, "dataset.arff")
    d1 = readARFF(path, data.reader = data.reader)
    d2 = RWeka::read.arff(path)
    for(i in 1:nrow(d1)){ 
      cat(i, fill = TRUE)
      expect_equal(d1$text[i], d2$text[i])
    }
    expect_equal(d1$text[7], d2$text[7])
    d1$text[7]
    d2$text[7]

the first string mismatch happens in row 7 of this data set and refers to the string ¤, which in RWeka is represented as ¤. I have experimented with the iconv function to convert the character into UTF-8 but it did not work. Does this work for other operating systems?

> d1$text[7]
[1] "Black Sheep Wall A&M, October 1989 cover 1. Black Sheep Wall (4:20) 2. [1]Broken Circle (Acoustic) (3:21) 3. [2]Notebook (Acoustic) (4:39) Known Formats UK (AM563) 7\" (1,2) UK (AMX563) 10\" (1,2,3) UK (AMCD563) CD (1,2,3) US (CD17875) CD (1) US (SP17801) 12\" (1,2,3) AU (?) 7\" (1,2) _________________________________________________________________ This is how I love you: I wish for a shade I can pull I feel so afraid of watching you grow up This love hurts to much And I try and build a wall So I don't have to see you fall And I pray Go away from my thoughts! Why do you keep coming back Over Black Sheep Wall? Oh, I'd love to hold you close But I play it cool And keep my thoughts in a jar Marked \"dangerous\" And everyone says, \"Never fear - All boys his age experiment with their lives\" But my eyes want to close you out I'll close you out Why do you keep coming back Over Black Sheep Wall? Brother Black Sheep, love is strong There's a shepherd out in every storm And he's not afraid of a little rain Why am I? Why do I keep building up This Black Sheep Wall? Oh, I love you so! Do you really know how much How deep? Black Sheep This is how I love you: With closed eyes With turned back With distance _________________________________________________________________ [3]\"Innocence Mission\" ¤ [4]Discography ¤ [5]Innocence Mission ¤ [6]Tony ¤ [7]NIWEB ¤ ¤ [8]comment References 1. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/circle.html 2. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/notebook.html 3. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/innmiss.html 4. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/discog.html 5. file://localhost/tony/IM 6. file://localhost/tony/ 7. file://localhost/ 8. file://localhost/tony/comment.html"

> d2$text[7]
[1] "Black Sheep Wall A&M, October 1989 cover 1. Black Sheep Wall (4:20) 2. [1]Broken Circle (Acoustic) (3:21) 3. [2]Notebook (Acoustic) (4:39) Known Formats UK (AM563) 7\" (1,2) UK (AMX563) 10\" (1,2,3) UK (AMCD563) CD (1,2,3) US (CD17875) CD (1) US (SP17801) 12\" (1,2,3) AU (?) 7\" (1,2) _________________________________________________________________ This is how I love you: I wish for a shade I can pull I feel so afraid of watching you grow up This love hurts to much And I try and build a wall So I don't have to see you fall And I pray Go away from my thoughts! Why do you keep coming back Over Black Sheep Wall? Oh, I'd love to hold you close But I play it cool And keep my thoughts in a jar Marked \"dangerous\" And everyone says, \"Never fear - All boys his age experiment with their lives\" But my eyes want to close you out I'll close you out Why do you keep coming back Over Black Sheep Wall? Brother Black Sheep, love is strong There's a shepherd out in every storm And he's not afraid of a little rain Why am I? Why do I keep building up This Black Sheep Wall? Oh, I love you so! Do you really know how much How deep? Black Sheep This is how I love you: With closed eyes With turned back With distance _________________________________________________________________ [3]\"Innocence Mission\" ¤ [4]Discography ¤ [5]Innocence Mission ¤ [6]Tony ¤ [7]NIWEB ¤ ¤ [8]comment References 1. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/circle.html 2. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/notebook.html 3. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/innmiss.html 4. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/discog.html 5. file://localhost/tony/IM 6. file://localhost/tony/ 7. file://localhost/ 8. file://localhost/tony/comment.html"

does not work

jakobbossek commented 8 years ago

No problems on OS X with this code.

berndbischl commented 7 years ago

isnt this basically a readr issue? what happens if you parse a file with a similar example plainly in readr?

berndbischl commented 7 years ago

@giuseppec can you try to set the local parameter in read_delim, to specify an encoding on our windows system? does this help?