Modelcube loading from files is slow

albireox commented 5 years ago

@zpace reported that doing batch operations with ModelCube loaded from files seems to be slower as compared with using raw astropy.fits. He agree to distill the issue to a simple script that we can test. We should profile the file loading and see if there is any obvious bottleneck. This is somehow related to #298.

zpace commented 5 years ago

I have run a quick test with DRP LOGCUBE instead of the DAP LOGCUBE (just b/c that's what it's handiest for me to script), and I'll try to summarize. I wrote a couple quick functions that respectively load through a bare wrapper around fits.open() and marvin.tools.Cube(), and profiled the result.

def get_fluxcube_fits(plate, ifu, MPL_V):
    '''a wrapper around fits.open()
    '''
    with m.load_drp_logcube(plate, ifu, MPL_V) as drp: # can replace with fits.open(drp_logcube_filename)
        return drp['FLUX'].data

def get_fluxcube_marvin(plate, ifu, **kwargs):
    '''a wrapper around marvin.tools.Cube
    '''
    drp_cube = marvin.tools.Cube('-'.join((plate, ifu)), **kwargs)
    return drp_cube.flux

def compare_two_loads(plate, ifu, MPL_V):
    '''runs both loaders
    '''
    fits_cube = get_fluxcube_fits(plate, ifu, MPL_V)
    marvin_cube = get_fluxcube_marvin(plate, ifu, release=MPL_V)
    return fits_cube, marvin_cube

%prun -D loadcube.prof compare_two_loads(plate, ifu, MPL_V)

I then loaded loadcube.prof (which github won't let me upload here) into snakeviz.

Here's a screengrab:

Screenshot_2019-07-19 loadcube

What jumped out at me is the large number of calls to zlib.Decompress.decompress (about 2/3 of the time taken by get_fluxcube_marvin()). Jose and I had talked in Ensenada about fits files having to be decompressed prior to use, so this makes sense. IIRC, individual HDUs are decompressed as needed.

Just to verify this, I wrote another bare wrapper that accessed every HDU of the LOGCUBE in a list comprehension, and profiled it:

def get_alldata_fits(plate, ifu, MPL_V):
    with m.load_drp_logcube(plate, ifu, MPL_V) as drp:
        data = [hdu.data for hdu in drp]
        return data

def compare_two_fullloads(plate, ifu, MPL_V):
    fits_data = get_alldata_fits(plate, ifu, MPL_V)
    marvin_cube = get_fluxcube_marvin(plate, ifu, release=MPL_V)

And sure enough, get_all_data_fits() takes 30-ish seconds. The culprit is once again the zlib decompression. I think users with lots of hard-disk space could profit from pre-extracting all the data files they plan to use, if marvin could "prioritize" ingesting uncompressed files over compressed files.

If there's any other tests that would be informative, please do let me know!

havok2063 commented 5 years ago

@zpace This is very useful. Just for clarification, your initial tests of opening up one extension with Marvin versus straight fits.open showed that Marvin took 2/3 longer to open up a single extension. Is that correct? But when you load the full fits hdu the times become equivalent? That might be due to the fact the Cube always loads the full hdu first before accessing cube.[extension].

Can you also profile astropy fits.open versus fitsio and see how they compare? Can you also unzip a manga cube and redo the profiling? If we take away the decompression, do they behave equivalently, or are there still overheads in Marvin cube loading?

zpace commented 5 years ago

@zpace This is very useful. Just for clarification, your initial tests of opening up one extension with Marvin versus straight fits.open showed that Marvin took 2/3 longer to open up a single extension. Is that correct? But when you load the full fits hdu the times become equivalent? That might be due to the fact the Cube always loads the full hdu first before accessing cube.[extension].

Yes, this is True.

Can you also profile astropy fits.open versus fitsio and see how they compare? Can you also unzip a manga cube and redo the profiling? If we take away the decompression, do they behave equivalently, or are there still overheads in Marvin cube loading?

I will take a stab at these in the next couple days.

zpace commented 5 years ago

I've made a couple changes to my ecosystem of fits-loader functions, which allows the user to specify which file extensions are preferred. The default behavior now tries .fits before .fits.gz, but it can be forced otherwise. I have uncompressed one DRP LOGCUBE and tried once again to load the data from the 'FLUX' extension.

Retrieval method	Uncompressed pref'd?	Flux cube retrieval time (ms)
`marvin.tools.Cube`	N	3556
`marvin.tools.Cube`	Y	ERROR
`astropy.io.fits`	N	1230
`astropy.io.fits`	Y	20

When I try to load the same uncompressed file through marvin, though:

OSError: filename /usr/data/minhas2/zpace/sdss/sas/mangawork/manga/spectro/redux/v2_5_3/8083/stack/manga-8083-12704-LOGCUBE.fits cannot be found: Not a gzipped file (b'SI')

I am using marvin version 2.3.2, which (I think) is the latest available on pyPI. Has there been a new update to allow uncompressed files?

It's pretty clear that at least for non-marvin loading, the decompression is the really expensive part. So users that can spare the extra storage overhead (about 2x, I think) could get some speed gains just by uncompressing all the FITS files they're using.

I'll try to look at fitsio soon, but in case I don't get to it today, I wanted to post what I have.

sdss / marvin

Modelcube loading from files is slow #665