renatopp / liac-arff

A library for read and write ARFF files in Python
MIT License
99 stars 49 forks source link

Issues when string data has a character `\n` or `\r\n` #89

Closed glemaitre closed 5 years ago

glemaitre commented 5 years ago

I am not sure that this is something easy to handle. Currently, if the data contain a string with the escape character \n or\r\n, then the parser will fail. I would assume this due to the fact that those characters are used to split the data as well.

jnothman commented 5 years ago

Yes. Lack of support for newlines in strings is something I'm a bit sad about. I'm not sure it's clear how to represent this in arff.

glemaitre commented 5 years ago

Would it be possible to replace the following:

https://github.com/renatopp/liac-arff/blob/e801852210c2ed65084248f50e1c2ca09514c1a8/arff.py#L729-L731

wiht a regular expression matching the quotes. Something like:

import re
PATTERN = re.compile(r'''((?:[^\n"']|"[^"]*"|'[^']*')+)''')
data = """relation\n'a\nb'"""
PATTERN.split(data)[1::2]
['relation', "'a\nb'"]

However, I am pretty bad with regular expression so I might miss something there.

jnothman commented 5 years ago

You're right that this might be the main place to handle it. I doubt your regular expression is sufficient.

The problem is that arff doesn't seem to define this case. Or does it? For example we may want to require that new line characters appear as \n rather than ASCII newlines to keep the parsing simple. But that is not in the spec.

mfeurer commented 5 years ago

To the best of my knowledge this is not defined.

From trying to read and then write the same file with WEKA I figured out:

Real newlines do not work as WEKA then thinks that a new data point/instance starts.

WEKA reads and emits the following file without any issues:

@relation foobar

@attribute a string
@attribute cls {'class A','class B'}

@data
'A\nB','class A'
'B\nC','class B'

I suggest that you write an email to the WEKA user list to ask for a clarification on how to handle newlines.

glemaitre commented 5 years ago

I am getting a bit lost to see the difference of real newlines and \n.

we have a new issue here as liac-arff does not behave like WEKA behaves on the arff snippet above as it removes the backslash while reading the file.

I think that this is basically the issue that I am mentioning here. If I have a string with a newline, you will consider it as a real new line instead of being part of a string.

For example we may want to require that new line characters appear as \n rather than ASCII newlines to keep the parsing simple.

@jnothman I have pretty limited knowledge there. How does a file look like with ASCII newlines?

jnothman commented 5 years ago

If weka reads and emits 'A\nB' that suggests it is not interpreting the \ as \ which would be emitted as 'A\nB'. Implicitly, then, \n is being used as an escaped newline, and we should handle the same.

ASCII newline looks like a new line :) I mean "line feed" = character 0x0a (at least in *NIX) or perhaps DOS's "carriage return, line feed" = 0x0d0a.

jnothman commented 5 years ago

It basically means that we should probably follow java conventions for escaping, including \n,\r,\t,\uxxxx,\0,\0x,\0xx,\b,\f ( https://stackoverflow.com/questions/1367322/what-are-all-the-escape-characters). Currently we only handle the literal escapes, i.e. \ -> \, \" -> ", \' -> '.

mfeurer commented 5 years ago

Implicitly, then, \n is being used as an escaped newline, and we should handle the same.

Definitely.

It basically means that we should probably follow java conventions for escaping

I totally agree on that. The one time I looked for a port of the java tokenizer I couldn't find anything.

Ideally, we'd have some compatibility tests which read an arff file in WEKA and emit the arff file again via the command line which are triggered automatically on travis-ci, but I currently don't have the time to look into this :(