genfromtxt returns inconsistent output if column converters are used (migrated from Trac #1665)

thouis commented 12 years ago

Original ticket http://projects.scipy.org/numpy/ticket/1665 Reported 2010-11-08 by trac user dmoore, assigned to atmention:pierregm.

''Disclaimer: This was tested on 1.4.1 I don't know if it was addressed in 1.5 (I suspect not based on mailing list correspondence).''

Simple example of the issue:

>>> import numpy, StringIO
>>>
>>> s=StringIO.StringIO('q1,2\nq3,4')
>>> a=numpy.genfromtxt(s,delimiter=',',converters={0:lambda s:float(s[1:])})
>>> a
array([(1.0, 2.0), (3.0, 4.0)],
      dtype=[('f0', '|O4'), ('f1', '<f8')])

Notice that this is a 1d array of tuples. Since the data are effectively the same type, it would have been preferrable to receive 2d output. Note that specifying the dtype does not help:

>>> s=StringIO.StringIO('q1,2\nq3,4')
>>> a=numpy.genfromtxt(s,delimiter=',',converters={0:lambda s:float(s[1:])},dtype=float)
>>> a
array([(1.0, 2.0), (3.0, 4.0)],
      dtype=[('f0', '|O4'), ('f1', '<f8')])

dtype appears to be ignored.

Strangely, however, if the data column could be interpreted without the converter, then the output is produced as desired even if the converter is used:

>>> s=StringIO.StringIO('1,2\n3,4')
>>> a=numpy.genfromtxt(s,delimiter=',',converters={0:lambda s:float(s)})
>>> a
array([[ 1.,  2.],
       [ 3.,  4.]])

thouis commented 12 years ago

Comment in Trac by atmention:pierregm, 2010-11-11

As you've seen, the output of your converter is not detected as a float, but as an object. That's an unfortunate side effect of using a lambda function such as yours: what if your input string has only 1 character ? You end up taking the float of an empty string, which raises a ValueError. In practice, that's exactly what happens below the hood when genfromtxt tries to guess the output type of the converter. It tries a single value ('1'), fails, and decides that the result must be an object... Probably not the best strategy, as it crashes in your case. But yours is a buggy case anyway.

Try that instead of your lambda function def func(s): try: r = float(s[1:]) except ValueError: r = 1. return r

You could object that as the dtype is defined, it should take precedence over the output typeof the converter. Well, I assumed exactly the opposite: if the user took the time to define a converter, we should respect his/her choice and overwrite the dtype.

thouis commented 12 years ago

Comment in Trac by trac user dmoore, 2010-11-12

As you've seen, the output of your converter is not detected as a float, but as an object. That's an unfortunate side effect of using a lambda function such as yours: what if your input string has only 1 character ? You end up taking the float of an empty string, which raises a ValueError?. In practice, that's exactly what happens below the hood when genfromtxt tries to guess the output type of the converter. It tries a single value ('1'), fails, and decides that the result must be an object... Probably not the best strategy, as it crashes in your case. But yours is a buggy case anyway.

Sorry to disagree, but why is my example a buggy case? If I can guarantee that all of my input will convert correctly using my converter (i.e. all of my column 1 input ALWAYS consists of an alpha char following by an integer) no bugs occur. IMO genfromtxt shouldn't pass invalid values to the converter.

btw, the code I gave is only an example to illustrate the problem. My actual code is much more complicated and involves converting dates and various strings to numeric values on very large datasets. In all of those cases, having to handle the case when genfromtxt passes the value '1' makes no sense in the context of my data and cruds up my code. I can at least use that as a workaround for now, but it wasn't obvious genfromtxt worked like that from the docs.

You could object that as the dtype is defined, it should take precedence over the output typeof the converter. Well, I assumed exactly the opposite: if the user took the time to define a converter, we should respect his/her choice and overwrite the dtype.

Not sure why a caller would pass a dtype when using the converter if they didn't want the dtype to be operative. Here's a use case where respecting the dtype would be helpful: column 1:N = float, column N+1 = int, column N+2 = date. Converter is used to convert the date to an int type. Now say the user wants all types to be int using dtype = int. Currently, I would have to use a converter for every column.

thouis commented 12 years ago

Comment in Trac by atmention:pierregm, 2010-11-13

Sorry, I should have picked a better word than "buggy". It remains that your converter is not fail-proof: what if one of your input is only 1 character long (for whatever reason) ?

In any case, I have to agree with you; the current behavior of genfromtxt with converters is a bit shady. I'm working as I'm writing that, following a suggestion from Luis on the mailing-list.

For your last example: in that case, provided you define a converter date->integer, you could use a simple dtype=int to get what you want, without having to define N+2 converters. In that case, your converter would output an int, which would be compatible with the dtype.

thouis commented 12 years ago

Comment in Trac by atmention:pierregm, 2010-11-13

OK, I just committed a bug fix on git: instead of taking an arbitrary value to determine the type of the output of the converter, the first value (from the first line) is used. Note that if you have names in your first row, we revert to the arbitrary value ('1').

Please test and let me know how it goes before I can close the ticket.

thouis / numpy-trac-migration

genfromtxt returns inconsistent output if column converters are used (migrated from Trac #1665) #3749