renatopp / liac-arff

A library for read and write ARFF files in Python
MIT License
99 stars 49 forks source link

Issue with separator space other than space #87

Closed glemaitre closed 5 years ago

glemaitre commented 5 years ago

We had a dataset with the character \u3000 which is one of the separator space. The regular expression in the encoder only matches \\ while it should use \s to match all possible separators and add quotes around the string to be compliant with the ARFF format.

I will open a PR to solve this issue.

mfeurer commented 5 years ago

Are you using the character as a separator instead of a comma? Or inside a string field?

According to the unit test inside the PR as the second, but you would like it to be treated exactly like a regular space? Is there any reference inside the ARFF manual regarding that? Also, do you know the behavior of WEKA with respect to this character?

mfeurer commented 5 years ago

Related to #90.

glemaitre commented 5 years ago

Are you using the character as a separator instead of a comma? Or inside a string field?

It is just a Japanese string with this space inside, so this is the second case.

According to the unit test inside the PR as the second, but you would like it to be treated exactly like a regular space?

It should be treated as a separator space. ARFF mentions that you should quote this string then. No idea what WEKA is doing with those characters.

mfeurer commented 5 years ago

I just checked the behavior of WEKA with this file:

@RELATION name

@ATTRIBUTE A STRING 
@ATTRIBUTE B STRING

@DATA
a, b
b e, a

and after reading it and then saving it I got this output:

@relation name

@attribute A string
@attribute B string

@data
a,b
b e,a

It seems like WEKA doesn't bother about the japanese whitespace. Therefore, I'm not sure if we need to take any action on this right now. Maybe we should first ask a question about this on the WEKA mailing list?

glemaitre commented 5 years ago

It seems like WEKA doesn't bother about the japanese whitespace.

This is not only for Japanese whitespace. We also the have the "BeerAdvocate" which encode white space with \xa0. I might think that this is not something so infrequent.

Therefore, I'm not sure if we need to take any action on this right now.

I strongly think that we should fix it now. Then, we can also point it to the WEKA implementation since they specified that string with spaces should be quoted.

mfeurer commented 5 years ago

No matter what WEKA does, adding those brackets should not hurt anyway. Let's add this.