Closed Witchblade101 closed 3 months ago
As a follow-up, I tried running a different dataset and saving the output to a network disk that has TB of free space. I got essentially the same error:
Starting Channel 268 of 393
Using the following limb-darkening values:
u1, -0.02927
u2, 0.44406
=========================
Starting lsq fit.
Starting lnprob: -1008930.39282718
Verbose lsq results: message: Optimization terminated successfully.
success: True
status: 0
fun: 1008688.5593286802
x: [ 7.406e-02 1.001e+00 -3.425e-03 1.033e+03]
nit: 3
direc: [[ 0.000e+00 0.000e+00 0.000e+00 1.000e+00]
[ 0.000e+00 1.000e+00 0.000e+00 0.000e+00]
[ 0.000e+00 0.000e+00 1.000e+00 0.000e+00]
[ 1.142e-04 3.668e-06 1.323e-05 -6.143e-04]]
nfev: 180
Ending lnprob: -1008688.5593286802
Reduced Chi-squared: 103.51799706001802
LSQ RESULTS:
rp: 0.07405848004589306
c0: 1.0011990524216912
c1: -0.0034245462251475587
scatter_ppm: 1033.2616961306856
Completed lsq fit.
-------------------------
Starting emcee fit.
Calling lsqfitter first...
Starting lnprob: -1008688.5593286802
Verbose lsq results: message: Optimization terminated successfully.
success: True
status: 0
fun: 1008688.5593286802
x: [ 7.406e-02 1.001e+00 -3.425e-03 1.033e+03]
nit: 1
direc: [[ 1.000e+00 0.000e+00 0.000e+00 0.000e+00]
[ 0.000e+00 1.000e+00 0.000e+00 0.000e+00]
[ 0.000e+00 0.000e+00 1.000e+00 0.000e+00]
[ 0.000e+00 0.000e+00 0.000e+00 1.000e+00]]
nfev: 84
Ending lnprob: -1008688.5593286802
Reduced Chi-squared: 103.51799706001869
LSQ RESULTS:
rp: 0.07405848004589306
c0: 1.0011990524316912
c1: -0.0034245461265187587
scatter_ppm: 1033.2616961306856
No covariance matrix from LSQ - falling back on a step size based on the prior range
Starting lnprob: -1016523.8337508459
Traceback (most recent call last):
File "/System/Volumes/Data/astro/jtste/doug/JWST/eureka/t1e/Obs2/run_eureka_2.py", line 31, in <module>
s5_meta = s5.fitlc(eventlabel, ecf_path=ecf_path)
File "/Users/dlong/Eureka/src/eureka/S5_lightcurve_fitting/s5_fit.py", line 482, in fitlc
meta, params = fit_channel(meta, time_temp, flux, channel,
File "/Users/dlong/Eureka/src/eureka/S5_lightcurve_fitting/s5_fit.py", line 990, in fit_channel
lc_model.fit(model, meta, log, fitter='emcee')
File "/Users/dlong/Eureka/src/eureka/S5_lightcurve_fitting/lightcurve.py", line 173, in fit
fit_model = self.fitter_func(self, model, meta, log, **kwargs)
File "/Users/dlong/Eureka/src/eureka/S5_lightcurve_fitting/fitters.py", line 330, in emceefitter
pool = Pool(meta.ncpu)
File "/Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/pool.py", line 196, in __init__
self._change_notifier = self._ctx.SimpleQueue()
File "/Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/context.py", line 113, in SimpleQueue
return SimpleQueue(ctx=self.get_context())
File "/Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/queues.py", line 341, in __init__
self._rlock = ctx.Lock()
File "/Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
File "/Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/synchronize.py", line 162, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/synchronize.py", line 57, in __init__
sl = self._semlock = _multiprocessing.SemLock(
OSError: [Errno 28] No space left on device
(eureka) dlong@hikaru Obs2 % /Users/dlong/miniconda3/envs/eureka/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 9980 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Try turning off multiprocessing and rerunning the code. It may be crashing within that part of the code and we wouldn't know because of how multiprocessing works.
Assuming it does crash, try solving that issue and turning multiprocessing back on.
Makes sense. I'll give that a try.
@Witchblade101, my first recommendation was going to be that you consider deleting any ...FluxData.h5
save files output during your Stage 3 analyses; these files can sometimes be helpful when trying to debug your analyses or to compare your analyses against other pipelines, but they are quite large and can rapidly eat up your storage space. In the next (upcoming) Eureka! version release, we've set those FluxData files to not be created by default. There are many command-line options that you can use to search for and delete any existing ...FluxData.h5
save files, and in your Stage 3 ECF you can also set save_fluxdata
to False
to avoid future such files being made.
Second, there is nothing that we can do on our end to resolve this No space left on device
error raised by your operating system. While there may be ~5 GB available on your system this is generally insufficient for many of the important tasks of your operating system and applications. In addition, when performing tasks that require more RAM than your computer physically has, the operating system will sometimes store that RAM data on your drive which is called "swap" memory. I strongly recommend you delete un-needed files or move them to an external hard drive or cloud storage. However, looking at your df -h
output, it looks to me like you actually have ~500 GB free and not just ~5 GB; if that is the case, then this is a very strange error indeed. As for your more recent attempt with the network disk, this error could potentially still arise if you really did have just ~5GB available locally regardless of where you were saving the outputs; 5 GB of storage just isn't enough breathing room for most OSes.
It is very unusal to me that in Stage 5 you're getting all those CRDS - WARNING - Failed creating CRDS cache lock: [Errno 28] No space left on device
messages. CRDS is only relevant for Stages 1-3, but its possible that with the multiprocessing turned on each spawned process has to re-import Eureka! and as a result tries to import the CRDS package and gets that cache lock error. I agree with Kevin that setting ncpu
to 1
is the best way to troubleshoot this - troubleshooting with multiprocessing turned on is basically impossible because of how that package works. Once you have something that works, then you can try turning multiprocessing back on.
Finally, my last note would be that it appears your fits from your most recent message using lsq
are exceptionally poor - you're getting deeply negative log-likelihoods and reduced chi-squared values that are nowhere near 1.0. This is highly unlikely to be the cause of the No space left on device
error, but it is something that requires thorough investigation before you continue working on those fits. I strongly recommend investigating the quality of all fits using the lsq
fitter to make sure you can get reasonable fits to your lightcurves before you move on to running the far more time-intensive (and storage consuming) emcee
or dynesty
samplers.
It looks like it is something related to multiprocessing. After switching ncpu to 1 I got no more CRDS warnings or crashes due to lack of available disk space.
Starting Channel 392 of 393
Starting lsq fit. Starting lnprob: 25306.134211625274
Verbose lsq results: message: Optimization terminated successfully. success: True status: 0 fun: -25311.01601500675 x: [ 6.993e-02 1.001e+00 1.067e-03 1.567e+00] nit: 2 direc: [[ 1.000e+00 0.000e+00 0.000e+00 0.000e+00] [ 0.000e+00 1.000e+00 0.000e+00 0.000e+00] [ 0.000e+00 0.000e+00 1.000e+00 0.000e+00] [ 0.000e+00 0.000e+00 0.000e+00 1.000e+00]] nfev: 94
Ending lnprob: 25311.01601500675 Reduced Chi-squared: 1.000409751911406
LSQ RESULTS: rp: 0.06992821433380171 c0: 1.0006942978011697 c1: 0.0010671513801134036 scatter_mult: 1.5666936233970443; 21259.428896367146 ppm
Starting emcee fit.
Calling lsqfitter first... Starting lnprob: 25311.01601500675
Verbose lsq results: message: Optimization terminated successfully. success: True status: 0 fun: -25311.016952630034 x: [ 6.980e-02 1.001e+00 1.056e-03 1.567e+00] nit: 1 direc: [[ 1.000e+00 0.000e+00 0.000e+00 0.000e+00] [ 0.000e+00 1.000e+00 0.000e+00 0.000e+00] [ 0.000e+00 0.000e+00 1.000e+00 0.000e+00] [ 0.000e+00 0.000e+00 0.000e+00 1.000e+00]] nfev: 62
Ending lnprob: 25311.016952630034 Reduced Chi-squared: 1.0004097205368443
LSQ RESULTS: rp: 0.06980393030064233 c0: 1.0006905172597063 c1: 0.0010556368532090882 scatter_mult: 1.5666935237694661; 21259.427544459293 ppm
No covariance matrix from LSQ - falling back on a step size based on the prior range Starting lnprob: 25289.035883316206 Running emcee burn-in... 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:43<00:00, 23.09it/s] Finished writing to /Users/dlong/DataAnalysis/JWST/eureka/t1e/Obs2/Stage5/S5_2024-07-08_t1e_2_run2/ap6_bg7/S5_emcee_samples_ch392.h5 Ending lnprob: 25311.001227579614 Mean acceptance fraction: 0.591 WARNING: Unable to estimate the autocorrelation time! Reduced Chi-squared: 0.9983547591005765
EMCEE RESULTS: rp: 0.0694942999224945 (+0.0037493809183321986, -0.003776110785145406) c0: 1.000690491270263 (+0.00023669803976011927, -0.00024450649265395086) c1: 0.0008652933562115306 (+0.003972529621944243, -0.0039819843810715745) scatter_mult: 1.5683060739344559 (+0.010369781942364265, -0.010505495954588273); 21281.30922895871 (+140.71394597018502, -142.55553283185938) ppm
========================= Saving results
Total time (min): 315.01
Well I'm glad you were able to get that analysis to complete! After doing some searching online (StackExchange, StackOverflow, Reddit, etc.), it seems clear to me that there is no issue on our end but rather an issue with your system (with a full temporary directory, filenames that are too long, etc.). Since there's nothing we can do, I'm going to close this issue and I recommend you check some of the online forums I mentioned above for solutions that might work in your situation and/or post a question on such a forum if you cannot find an already published solution.
FAQ check
Instrument
Light curve fitting (Stages 4-6)
What happened?
During stage 5 light curve fitting I suddenly start getting warnings about "No space left on device". Output files keep being written, but eventually Eureka! crashes. A df when the warnings started appearing and after the crash show 5 GB still available. Rebooting and trying again with nothing running but iterm and Eureka! produced the same results.
Error traceback output
What operating system are you using?
MacOS Sonoma 14.5
What version of Python are you running?
Python 3.10.14
What Python packages do you have installed?
Code of Conduct