Open keflavich opened 1 year ago
Do we need to add a kwarg to set the temp directory to write to? Or write to the current directory like CASA?
I think this is a documentation need first. Writing to the current directory is not a better default option - it depends on the machine & architecture of the storage system. But we should try to prevent writing tmp files larger than the tmp drive - frankly, I think dask should be doing this, but we will lose users if we don't come up with a solution
Agreed. Looks like it can be set in a .dask/config.yml
file, or from the CMD line (https://docs.dask.org/en/latest/configuration.html#yaml-files; https://stackoverflow.com/questions/40042748/how-to-specify-the-directory-that-dask-uses-for-temporary-files)
@ashleythomasbarnes Could you fill in more details about what you're trying? I think we can come to a solution but we need tracebacks and/or details about what went wrong.
I'm trying to create a mean spectrum of a large MUSE datacube (~60GB), but this was filling up /tmp
on the computer I was using. The code I am using with spectral_cube.__version__ = '0.6.2'
is given below. I can see if this is solved using the solutions from @e-koch.
infile = '../data/ngc0628c/muse/NGC0628-0.92asec.fits'
hdu = fits.open(infile)[1]
cube = SpectralCube.read(hdu)
cube.allow_huge_operations=True
spec_mean = cube.mean(axis=(1,2))
@ashleythomasbarnes Thanks, that's helpful. Could you confirm that the cube is being read as a DaskSpectralCube
?
There are a few workarounds for this. Some are to do with dask, as noted above, but another approach is to force a non-dask spectral cube and do spec_mean = cube.mean(axis=(1,2), strategy='slice')
, which will do a channel-by-channel mean and therefore only load a small fraction of the cube into memory at any given time.
Or set the temporary directory to a location that has sufficient storage:
TEMPDIR='mydir' python cube_script.py
I don't think so @adamginsburg...
I'm not explicitly using the use_dask=True
when loading, and this is cube
I'm using.
SpectralCube with shape=(3761, 1426, 1412):
n_x: 1412 type_x: RA---TAN unit_x: deg range: 24.133237 deg: 24.214712 deg
n_y: 1426 type_y: DEC--TAN unit_y: deg range: 15.741643 deg: 15.820816 deg
n_s: 3761 type_s: AWAV unit_s: Angstrom range: 4700.000 Angstrom: 9400.000 Angstrom
OK, then there's a different answer here. Try my suggestion, spec_mean = cube.mean(axis=(1,2), how='slice')
(note: keyword is how
, not strategy
).
The other thing you can do is pass the memmap_dir
keyword or specify TMPDIR
globally to force it to write somewhere else.
@e-koch real issue here, though: How did Ash hit a case where tempfiles were being used? Tempfiles are only created by the parallel
versions of the code, which .mean
doesn't access, afaict.
https://github.com/radio-astro-tools/spectral-cube/blob/master/spectral_cube/spectral_cube.py#L2922-L2924
Is it the memory mapping in astropy.io.fits
?
No, that's not relevant - fits
's memory mapping just loads the file on disk, it won't create any new files in temp directories, at least afaik.
@d-l-walker @ashleythomasbarnes please help fill in details!
The brief version is: running some code in this file: https://github.com/ACES-CMZ/reduction_ACES/blob/main/aces/joint_deconvolution/reproject_mosaic_funcs.py resulted in failures because the
/tmp
drive got filled up.This is almost certainly a side effect of
dask
caching files to /tmp directories.We need to add documentation about this problem, and/or, do a filesystem size check before dumping things to
/tmp