fix FITS violations before loading files

astronasko commented 6 months ago

If you work with FITS files that do not conform with the standard, astropy throws a similar error during tellucleaning:

Traceback (most recent call last):
  File "lbl/recipes/lbl_telluclean.py", line 75, in main
    namespace = __main__(inst)
  File "lbl/recipes/lbl_telluclean.py", line 169, in __main__
    sci_image, sci_hdr = inst.load_science_file(filename)
  File "lbl/instruments/default.py", line 750, in load_science_file
    sci_hdr = self.load_header(science_file, kind='science fits file')
  File "lbl/instruments/harpsn.py", line 303, in load_header
    return io.LBLHeader.from_fits(hdr, filename)
  File "lbl/core/io.py", line 180, in from_fits
    new[key] = copy.deepcopy(header[key])
  File "/astropy/io/fits/header.py", line 177, in __getitem__
    value = card.value
  File "/astropy/io/fits/card.py", line 293, in value
    value = self._value = self._parse_value()
  File "/astropy/io/fits/card.py", line 764, in _parse_value
    raise VerifyError(
astropy.io.fits.verify.VerifyError: Unparsable card (TNG EXP_METER_B EXP CENTROID), fix it first with .verify('fix').

In this case some of my spectra have an unparsable card (TNG EXP_METER_B EXP CENTROID), which causes the fatal error. However, Astropy has some verification procedures that can "fix" the FITS, so that LBL accepts it. This PR aims to implement this.

From what I understand, the main IO functions in the project are load_fits and load_header in lbl/core/io.py. They rely on astropy's fits.getdata and fits.getheader respectively; but astropy's verification works only for high-level class objects.

The only solution I found so far is to define a common function that loads the HDU first, corrects it and gets wrapped by load_header and load_data:

def load_hdu(filename: str,
              kind: Union[str, None] = None,
              extnum: Optional[int] = None,
              extname: Optional[str] = None) -> fits.HDUList:
    # deal with no kind
    if kind is None:
        kind = 'fits file'
    # try to load fits file
    try:
        if extnum is not None:
            hdu = fits.open(filename, ext=extnum)
        elif extname is not None:
            hdu = fits.open(filename, extname=extname)
        else:
            hdu = fits.open(filename, extname=extname)
    except Exception as e:
        emsg = 'Cannot load {0}. Filename: {1} \n\t{2}: {3}'
        eargs = [kind, filename, type(e), str(e)]
        raise LblException(emsg.format(*eargs))
    # attempt to verify hdu https://github.com/njcuk9999/lbl/pull/44
    hdu.verify('silentfix')
    return hdu

def load_fits(filename: str,
              kind: Union[str, None] = None,
              extnum: Optional[int] = None,
              extname: Optional[str] = None) -> np.ndarray:
    hdu = load_hdu(filename, kind, extnum, extname)
    return hdu[0].data

def load_header(filename: str,
                kind: Union[str, None] = None,
                extnum: Optional[int] = None,
                extname: Optional[str] = None) -> fits.Header:
    hdu = load_hdu(filename, kind, extnum, extname)
    return hdu[0].header

It looks a bit ugly. Additionally, I am not sure if kind, extnum, extname will be carried over correctly to fits.open -- I dug in astropy's source code but could not figure it out. My tellucleaning works without error, and I am rather inclined to think that the astropy verification this won't affect the science (unless LBL looks for very specific keywords).

The global diff is quite convoluted; I suggest to look at the commit diffs one at a time, if you are interested.

njcuk9999 commented 5 months ago

Sorry for not getting on this sooner I was on leave until recently.

I'm not sure we can do this, the reason we return a np.array is that the hdu is left open in this form which is a really bad idea and leads to memory leaks...

a better way would to internally use something like:

with fits.open(xxxx) as hdu:
    # do stuff 
    # copy image using np.array

that way we know the hdu is closed and not left floating until the program exits (this is especially bad when we multi-process you can get a lot of files stuck open).

I'll try to figure something out in regards to this and make a patch.

njcuk9999 commented 5 months ago

Turns out your solution above also removes the inbuilt fits.getdata ability to "guess" the extension.

So I think we will have to go with a solution just to verify on opening the header, and in the case no extnum or extname is given assume it is the primary extension.

njcuk9999 commented 5 months ago

So this is my solution for just the load_header function:

def load_header(filename: str,
                kind: Union[str, None] = None,
                extnum: Optional[int] = None,
                extname: Optional[str] = None) -> fits.Header:
    """
    Standard way to load fits file header only

    :param filename: str, the filename
    :param kind: the kind (for error message)
    :param extnum: int, the extension number (if not given uses first)
    :param extname: the extension name (if not given uses extnum)

    :return: FITS header, the header
    """
    # deal with no kind
    if kind is None:
        kind = 'fits file'
    # try to load fits file
    try:
        if extnum is not None:
            with fits.open(filename) as hdu:
                hdu.verify('silentfix')
                raw_hdr = hdu[extnum].header
                header = raw_hdr.copy()

        elif extname is not None:
            with fits.open(filename) as hdu:
                hdu.verify('silentfix')
                raw_hdr = hdu[extname].header
                header = raw_hdr.copy()
        else:
            with fits.open(filename) as hdu:
                hdu.verify('silentfix')
                raw_hdr = hdu[0].header
                header = raw_hdr.copy()
    except Exception as e:
        emsg = 'Cannot load {0}. Filename: {1} \n\t{2}: {3}'
        eargs = [kind, filename, type(e), str(e)]
        raise LblException(emsg.format(*eargs))
    return header.copy()

I don't have a file with a bad header, I'd like to test one and make sure the load_fits function still works (it seems in your function that it should as it crashed only in load_header), I'll also need to test if the hdu.verify('silentfix') slows down reading, if it is the case I think you'll have to fix and re-save your files externally.

njcuk9999 commented 5 months ago

Okay sorry to spam this issue, but it does seem to be 3 to 4 times slower to open a header:

In [7]: %timeit load_header(path)
9.84 ms ± 33.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit load_header_old(path)
2.89 ms ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

where load_header_old is the original function, so I'll have to make sure there isn't a place we are loading many headers (I don't think this is ever the case), but again I'd like to try on a "bad" file, maybe it will be even slower if it has to fix the header.

njcuk9999 commented 4 months ago

@astronasko could you send me a file that has a "bad" header?

astronasko commented 4 months ago

Hi Neil,Apologies for the radio silence — I will have very limited internet connection until the end of this week - from there on, I am to tackle the task you assigned to me.

astronasko commented 4 months ago

Hi @njcuk9999 ,

I tested it for ~10³ spectra of ~15 different stars, it works like a charm. The HDU verification is expected to take time because all columns need to be checked against the standard (iirc), but I don't think LBL's performance will suffer much. I had a look at io.load_header calls throughout LBL's source -- it seems to me that you run this function once per spectral file. Taking your benchmark into account, this would correspond to an additional overhead of <10s per thousand spectra.

I now revert to your solution, with a small change (I silentfix in one context manager because you use the latter in all three cases anyway).

njcuk9999 commented 4 months ago

Actually this is not quite correct, for other instruments we have to load headers from wave files and blaze files and other files so its probably nearer 5 header calls per object for some instruments - it is not desirable.

Also this benchmark is on a file that did not need fixing - I would still need a file that needing fixing to really know how long it would take.

In fact we actually lose some extra functionality of fits.getdata and fits.getheader where it "auto selects" the extension - it would be a real pain to write the auto select feature just for this problem - or indeed go through every line of code for every instrument and make sure we were getting the correct hdu.

I'm actually more tempted to put this down to "bad input data", rather than slow down everyone else for a few files.

Could a solution be to just fix the headers before using LBL? i.e. could you apply this function before adding the data to LBL? I would hope that if you did the hdu.verify('silentfix') and then re-save the fits file it would then function in LBL?

astronasko commented 4 months ago

If that's the case, I agree with you -- it would be better to fix bad headers beforehand, and not at runtime. Thank you for your attention to this issue.

njcuk9999 / lbl

fix FITS violations before loading files #45