spacetelescope / stdatamodels

https://stdatamodels.readthedocs.io
Other
5 stars 24 forks source link

Assigning to `data` of a `BinTableHDU` risks corrupting the original array #215

Open braingram opened 11 months ago

braingram commented 11 months ago

If a record array with an unsigned dtype is assigned to the data attribute of a BinTableHDU the data in the original array is modified when the BinTableHDU is saved (as seen in the example below).

from astropy.io import fits
import numpy as np

tbl = np.array([1, 2, 3], dtype=[('a', '<u2')])
hdulist = fits.HDUList()
hdulist.append(fits.PrimaryHDU())
hdulist.append(fits.BinTableHDU())
hdulist[-1].data = tbl
assert tbl['a'][0] == 1, tbl['a'][0]
hdulist.writeto('foo.fits', overwrite=True)
assert tbl['a'][0] == 1, tbl['a'][0]

Fails with: AssertionError: 32769

This occurs in astropy 5.3 and a recently downloaded 6.0 nightly.

There are many places in stdatamodels where hdu.data is assigned. Given that the modification occurs when the file is saved (which may occur at some later point in the code) this seems likely to introduce insidious bugs.

Updating the assignments to hdu.data to either copy the assigned array and/or to define the array at the time of creation of the BinTableHDU (passing the array in via the data keyword of BinTableHDU.__init__ does not appear to have the same issue) should be considered.

braingram commented 11 months ago

There also appears to be a risk of the written data being corrupted as in the following example:

from astropy.io import fits
import numpy as np

tbl = np.array([(1, 'HOT', )], dtype=[('i', '>u2'), ('a', 'U4')])
hdulist = fits.HDUList()
hdulist.append(fits.PrimaryHDU())
hdulist.append(fits.BinTableHDU())
hdulist[-1].data = tbl
fr = hdulist[-1].data
hdulist.writeto('foo.fits', overwrite=True)

with fits.open('foo.fits') as ff:
    print(ff[-1].data)

Outputs:

WARNING: Unexpected extra padding at the end of the file.  This padding may not be preserved when saving changes. [astropy.io.fits.header]
[(32768, '\x00\x00\x10')]

Removing the 'a' column results in a correct output for 'i' (1) and removing 'i' results in the correct output for 'a' ('HOT'). However including both results in both items being corrupted.