Loading aligned dtype error (migrated from Trac #1619)

thouis commented 12 years ago

Original ticket http://projects.scipy.org/numpy/ticket/1619 Reported 2010-09-24 by trac user Ihor.Melnyk, assigned to unknown.

import numpy as np

t = np.dtype('i1, i4, i1', align=True)
d = np.zeros(1, t)

np.save("test.npy", d)
data = np.load("test.npy")

Traceback (most recent call last):
  File "D:\Projects\Cuda\Cuda_Git\pathwise\liinc\model\feeds\numpy_bug.py", line 8, in <module>
    data = np.load("test.npy")
  File "D:\Projects\Cuda\Cuda_Git\pathwise\pathwise\vendors\lib64\x64\python\numpy\lib\npyio.py", line 314, in load
    return format.read_array(fid)
  File "D:\Projects\Cuda\Cuda_Git\pathwise\pathwise\vendors\lib64\x64\python\numpy\lib\format.py", line 440, in read_array
    shape, fortran_order, dtype = read_array_header_1_0(fp)
  File "D:\Projects\Cuda\Cuda_Git\pathwise\pathwise\vendors\lib64\x64\python\numpy\lib\format.py", line 358, in read_array_header_1_0
    dtype = numpy.dtype(d['descr'])
ValueError: two fields with the same name

thouis commented 12 years ago

Attachment in Trac by trac user jpeel, 2010-12-29: 0001-BF-added-fix-for-loading-aligned-dtype.patch

thouis commented 12 years ago

Comment in Trac by trac user jpeel, 2010-12-29

I've come up with a possible solution, but first let me explain the real problem here.

The real problem here is that dtype() can't handle a list that describes an aligned dtype. For instance, in the example given above, the list that is put into dtype() when the file is loaded is

{{{[('f0', '|i1'), ('', '|V3'), ('f1', '<i4'), ('f2', '|i1')]}}}

on my machine. indicates that this is an aligned dtype. The first element in each tuple is the name of the field. When dtype finds a name equal to {{{''}}}, it currently sets the name of that field to {{{'f#ind'}}} where #ind is the index of the tuple in the list. That means that in the above list has its name set to {{{'f1'}}} which then causes an error when the next item has a name of {{{'f1'}}}.

Okay, so the problem then is that dtype() currently doesn't know how to read in an aligned type correctly. Here is how I approached solving this.

If the align parameter doesn't equal 1 and the name of the first tuple is not equal to {{{''}}}, then set checkalign to 1 so that this input list could indicate an aligned array.
If checkalign, then trigger that a tuple deals with alignment when: a. a tuple's name is {{{''}}} b. a tuple's title is NULL c. a tuple has only two elements d. the datatype of the tuple is VOID
When we find a tuple that meets all of the above conditions, a. add the size of the VOID tuple to the totalsize b. set maxalign to MAX(maxalign, current element size + previous element size) c. adjust the name index accordingly
Make sure that the alignment is set correctly when checkalign==1.
Readjust the size of the nameslist tuple if necessary. (if there were any aligning tuples, then nameslist initialized to be too large).

3.c. refers to an additional index, ii, to keep track of the indices in the nameslist. This is necessary because i refers to the index in the input list.

There might be a better approach than this, but this is one possible solution. Other approaches might involve changing the npy format or how aligned dtypes are shown. However, I think that those will be much harder to implement.

In the patch that I'm attaching, I've kept everything in the _convert_from_array_desc function, but it might be a good idea to break it up a bit or make some other changes to make the code more manageable.

thouis commented 12 years ago

Comment in Trac by atmention:rgommers, 2011-03-11

Charles Harris sent a message to the Numpy mailing list about this ticket on March 3rd, probably good to reply there. For completeness, his comments:

In [2]: np.dtype('i1, i4, i1', align=True)
Out[2]: dtype([('f0', '|i1'), ('', '|V3'), ('f1', '<i4'), ('f2', '|i1')])

In [3]: dtype([('f0', '|i1'), ('', '|V3'), ('f1', '<i4'), ('f2', '|i1')])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/home/charris/<ipython console> in <module>()

ValueError: two fields with the same name

Note that the second field in the dtype is inserted for alignment purposes and isn't 
named. However, the list in the dtype can not be used to define a dtype because the 
empty name is replaced by 'f1', which conflicts with the following field.  The patch 
attached to the ticket has a rather complicated work around for this that I would 
rather avoid. I would prefer some sort of reserved name for alignment fields, maybe 
something like '*0', '*1', etc. Another question I have is if alignment is supposed to 
be preserved across architectures, that is, should it be possible to pickle/savez on 
one architecture and read things in on another and still have the elements aligned. I 
suspect this isn't worth doing, but one way or the other should be decided.

thouis commented 12 years ago

Comment in Trac by atmention:mwiebe, 2011-03-16

I think the bug here is in repr not producing a good string to reconstruct the data type. Rather than coming up with an awkward workaround, I would suggest repr print the following when the exact type would not be reconstructed with the normal repr:

dtype({'names': ['f0', 'f1', 'f2'], 'formats': [dtype('int8'), dtype('int32'), dtype('int8')], 'offsets': [0, 4, 8]})

Unfortunately, however, this is still not good enough! It produces a type with itemsize 9 instead of 12. I think it's also necessary to extend the dtype construction syntax to something like:

dtype({'names': ['f0', 'f1', 'f2'], 'formats': [dtype('int8'), dtype('int32'), dtype('int8')], 'offsets': [0, 4, 8]}, itemsize=12)

Also, to see why the proposed patch doesn't work, consider the following example.

>>> t = np.dtype('i4, i1', align=True)
>>> s = eval(repr(t))
>>> t
dtype([('f0', '<i4'), ('f1', '|i1')])
>>> s
dtype([('f0', '<i4'), ('f1', '|i1')])
>>> t.itemsize
8
>>> s.itemsize
5

Here there is no clue provided to get the correct itemsize, extending the dtype constructor to have an itemsize parameter is the only reasonable way I can think of to make this work in general.

thouis commented 12 years ago

Comment in Trac by atmention:charris, 2011-03-16

What if the dtype included an alignment number? Then align=1, would be no alignment, align=4 would be on 4 byte boundaries, etc. I think that would provide enough information to recreate the structure.

thouis commented 12 years ago

Comment in Trac by atmention:mwiebe, 2011-03-16

Yes, that would work for the specific case of aligned data types, but it's possible (from C) to create data types with sizes that match neither the tightly packed nor the aligned pattern, and I think it's desirable to support that.

I think at some point a function like 'dtype.restrict_to_fields(fields)' would be useful, which would make a data type exactly matching the dtype, but only having the fields mentioned. This would generate a more general data type, so to support the possibility of such a function it needs to be more general.

thouis commented 12 years ago

Comment in Trac by atmention:mwiebe, 2011-03-16

On the other hand, also including an alignment number would be required for doing reasonable type promotions between structured types. If any of inputs are aligned, it would make sense for the promoted type to be aligned as well, and it may have a different set of fields than any of the inputs.

thouis commented 12 years ago

Comment in Trac by atmention:pv, 2011-04-02

+1 for adding the itemsize parameter. This could be a workaround for #1790 if we decide not to include the trailing padding.

-1 for align keyword, as it seems that it is not a very explicit specification of the data structure. I'm also not sure if it's portable if stored in .npy files. Native-alignment can in principle checked in UpdateFlags...

thouis commented 12 years ago

Comment in Trac by atmention:mwiebe, 2011-06-21

I believe the align keyword can be well-defined, and C compilers in general have a consistent approach. The reason I really want this in is for in-memory calculations, where operations combining or manipulating aligned structured arrays should produce new aligned structured arrays, not throw away the alignment and force lots of extra copying and buffering.

I'm not sure what the .npy format is doing, I wish it would have defined a NumPy + Python independent specification for the structure. I've fixed the repr issue, but the .npy format still breaks.

thouis commented 12 years ago

Comment in Trac by atmention:mwiebe, 2011-06-23

Repr now produces reliable strings:

>>> np.dtype('i1, i4, i1', align=True)
dtype({'names':['f0','f1','f2'], 'formats':['i1','<i4','i1'], 'offsets':[0,4,8], 'itemsize':12}, align=True)

but the original issue with the .npy file format is still there.

thouis / numpy-trac-migration

Loading aligned dtype error (migrated from Trac #1619) #3170