Open thouis opened 12 years ago
Attachment in Trac by trac user jpeel, 2010-12-29: 0001-BF-added-fix-for-loading-aligned-dtype.patch
Comment in Trac by trac user jpeel, 2010-12-29
I've come up with a possible solution, but first let me explain the real problem here.
The real problem here is that dtype() can't handle a list that describes an aligned dtype. For instance, in the example given above, the list that is put into dtype() when the file is loaded is
{{{[('f0', '|i1'), ('', '|V3'), ('f1', '<i4'), ('f2', '|i1')]}}}
on my machine. indicates that this is an aligned dtype. The first element in each tuple is the name of the field. When dtype finds a name equal to {{{''}}}, it currently sets the name of that field to {{{'f#ind'}}} where #ind is the index of the tuple in the list. That means that
in the above list has its name set to {{{'f1'}}} which then causes an error when the next item has a name of {{{'f1'}}}.
Okay, so the problem then is that dtype() currently doesn't know how to read in an aligned type correctly. Here is how I approached solving this.
3.c. refers to an additional index, ii, to keep track of the indices in the nameslist. This is necessary because i refers to the index in the input list.
There might be a better approach than this, but this is one possible solution. Other approaches might involve changing the npy format or how aligned dtypes are shown. However, I think that those will be much harder to implement.
In the patch that I'm attaching, I've kept everything in the _convert_from_array_desc function, but it might be a good idea to break it up a bit or make some other changes to make the code more manageable.
Comment in Trac by atmention:rgommers, 2011-03-11
Charles Harris sent a message to the Numpy mailing list about this ticket on March 3rd, probably good to reply there. For completeness, his comments:
In [2]: np.dtype('i1, i4, i1', align=True)
Out[2]: dtype([('f0', '|i1'), ('', '|V3'), ('f1', '<i4'), ('f2', '|i1')])
In [3]: dtype([('f0', '|i1'), ('', '|V3'), ('f1', '<i4'), ('f2', '|i1')])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/charris/<ipython console> in <module>()
ValueError: two fields with the same name
Note that the second field in the dtype is inserted for alignment purposes and isn't
named. However, the list in the dtype can not be used to define a dtype because the
empty name is replaced by 'f1', which conflicts with the following field. The patch
attached to the ticket has a rather complicated work around for this that I would
rather avoid. I would prefer some sort of reserved name for alignment fields, maybe
something like '*0', '*1', etc. Another question I have is if alignment is supposed to
be preserved across architectures, that is, should it be possible to pickle/savez on
one architecture and read things in on another and still have the elements aligned. I
suspect this isn't worth doing, but one way or the other should be decided.
Comment in Trac by atmention:mwiebe, 2011-03-16
I think the bug here is in repr not producing a good string to reconstruct the data type. Rather than coming up with an awkward workaround, I would suggest repr print the following when the exact type would not be reconstructed with the normal repr:
dtype({'names': ['f0', 'f1', 'f2'], 'formats': [dtype('int8'), dtype('int32'), dtype('int8')], 'offsets': [0, 4, 8]})
Unfortunately, however, this is still not good enough! It produces a type with itemsize 9 instead of 12. I think it's also necessary to extend the dtype construction syntax to something like:
dtype({'names': ['f0', 'f1', 'f2'], 'formats': [dtype('int8'), dtype('int32'), dtype('int8')], 'offsets': [0, 4, 8]}, itemsize=12)
Also, to see why the proposed patch doesn't work, consider the following example.
>>> t = np.dtype('i4, i1', align=True)
>>> s = eval(repr(t))
>>> t
dtype([('f0', '<i4'), ('f1', '|i1')])
>>> s
dtype([('f0', '<i4'), ('f1', '|i1')])
>>> t.itemsize
8
>>> s.itemsize
5
Here there is no clue provided to get the correct itemsize, extending the dtype constructor to have an itemsize parameter is the only reasonable way I can think of to make this work in general.
Comment in Trac by atmention:charris, 2011-03-16
What if the dtype included an alignment number? Then align=1, would be no alignment, align=4 would be on 4 byte boundaries, etc. I think that would provide enough information to recreate the structure.
Comment in Trac by atmention:mwiebe, 2011-03-16
Yes, that would work for the specific case of aligned data types, but it's possible (from C) to create data types with sizes that match neither the tightly packed nor the aligned pattern, and I think it's desirable to support that.
I think at some point a function like 'dtype.restrict_to_fields(fields)' would be useful, which would make a data type exactly matching the dtype, but only having the fields mentioned. This would generate a more general data type, so to support the possibility of such a function it needs to be more general.
Comment in Trac by atmention:mwiebe, 2011-03-16
On the other hand, also including an alignment number would be required for doing reasonable type promotions between structured types. If any of inputs are aligned, it would make sense for the promoted type to be aligned as well, and it may have a different set of fields than any of the inputs.
Comment in Trac by atmention:pv, 2011-04-02
+1 for adding the itemsize parameter. This could be a workaround for #1790 if we decide not to include the trailing padding.
-1 for align
keyword, as it seems that it is not a very explicit specification of the data structure. I'm also not sure if it's portable if stored in .npy
files. Native-alignment can in principle checked in UpdateFlags...
Comment in Trac by atmention:mwiebe, 2011-06-21
I believe the align keyword can be well-defined, and C compilers in general have a consistent approach. The reason I really want this in is for in-memory calculations, where operations combining or manipulating aligned structured arrays should produce new aligned structured arrays, not throw away the alignment and force lots of extra copying and buffering.
I'm not sure what the .npy format is doing, I wish it would have defined a NumPy + Python independent specification for the structure. I've fixed the repr issue, but the .npy format still breaks.
Comment in Trac by atmention:mwiebe, 2011-06-23
Repr now produces reliable strings:
>>> np.dtype('i1, i4, i1', align=True)
dtype({'names':['f0','f1','f2'], 'formats':['i1','<i4','i1'], 'offsets':[0,4,8], 'itemsize':12}, align=True)
but the original issue with the .npy file format is still there.
Original ticket http://projects.scipy.org/numpy/ticket/1619 Reported 2010-09-24 by trac user Ihor.Melnyk, assigned to unknown.