Closed glemaitre closed 5 years ago
Are you using the character as a separator instead of a comma? Or inside a string field?
According to the unit test inside the PR as the second, but you would like it to be treated exactly like a regular space? Is there any reference inside the ARFF manual regarding that? Also, do you know the behavior of WEKA with respect to this character?
Related to #90.
Are you using the character as a separator instead of a comma? Or inside a string field?
It is just a Japanese string with this space inside, so this is the second case.
According to the unit test inside the PR as the second, but you would like it to be treated exactly like a regular space?
It should be treated as a separator space. ARFF mentions that you should quote this string then. No idea what WEKA is doing with those characters.
I just checked the behavior of WEKA with this file:
@RELATION name
@ATTRIBUTE A STRING
@ATTRIBUTE B STRING
@DATA
a, b
b e, a
and after reading it and then saving it I got this output:
@relation name
@attribute A string
@attribute B string
@data
a,b
b e,a
It seems like WEKA doesn't bother about the japanese whitespace. Therefore, I'm not sure if we need to take any action on this right now. Maybe we should first ask a question about this on the WEKA mailing list?
It seems like WEKA doesn't bother about the japanese whitespace.
This is not only for Japanese whitespace. We also the have the "BeerAdvocate" which encode white space with \xa0
. I might think that this is not something so infrequent.
Therefore, I'm not sure if we need to take any action on this right now.
I strongly think that we should fix it now. Then, we can also point it to the WEKA implementation since they specified that string with spaces should be quoted.
No matter what WEKA does, adding those brackets should not hurt anyway. Let's add this.
We had a dataset with the character
\u3000
which is one of the separator space. The regular expression in the encoder only matches\\
while it should use\s
to match all possible separators and add quotes around the string to be compliant with the ARFF format.I will open a PR to solve this issue.