Open jmdobbs opened 2 weeks ago
Also, with --iter 0 (but not on an A40, I asked for 300GB), I get the following different error (which certainly looks like OOM)
122/154 123/154 124/154 125/154 126/154 127/154 128/154 129/154 130/154 131/154 132/154 133/154 134/154 135/154 136/154 137/154 138/154 139/154 140/154 141/154 142/154 143/154 144/154 145/154 146/154 147/154 148/154 149/154 150/154 151/154 152/154 153/154 154/154
Commiting changes in fid_all...Unhandled exception.Unhandled exception.Unhandled exception. Unhandled exception. System.NotImplementedException: The method or operation is not implemented.
at MCore.MCore.WorkerDied(Object sender, EventArgs e)
at Warp.WorkerWrapper.ReportDeath()
at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
at MCore.MCore.WorkerDied(Object sender, EventArgs e)
at Warp.WorkerWrapper.ReportDeath()
at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
at MCore.MCore.WorkerDied(Object sender, EventArgs e)
at Warp.WorkerWrapper.ReportDeath()
at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
at MCore.MCore.WorkerDied(Object sender, EventArgs e)
at Warp.WorkerWrapper.ReportDeath()
at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()
/var/spool/slurm/job6874236/slurm_script: line 15: 734064 Aborted (core dumped) TF_FORCE_UNIFIED_MEMORY='1' XLA_PYTHON_CLIENT_MEM_FRACTION='4.0' MCore --population m3/fid_all_v3.population --perdevice_refine 1 --iter 0
Hi @jmdobbs - I'm not aware of how M has changed internally over time so can't really comment. I'll chat to Dimitry about it but he's out at a conference until the end of the week.
Could you post worker logs rather than the stdout please? They're more useful for seeing what's going on
Thanks @alisterburt. I had the output going into a different directory and forgot about the worker logs.
Interestingly, it doesn't appear to be memory related after all: it can't find the species (sometimes). As I mentioned earlier, it sometimes runs through with no issue.
System.Collections.Generic.KeyNotFoundException: The given key 'Warp.Sociology.Species' was not present in the dictionary.
at Warp.TiltSeries.<>c__DisplayClass100_2.<PerformMultiParticleRefinement>b__29(Double[] input)
at Accord.Math.Optimization.NonlinearObjectiveFunction.CheckGradient(Func`2 value, Double[] probe)
at Accord.Math.Optimization.BaseGradientOptimizationMethod.Maximize()
at Warp.TiltSeries.PerformMultiParticleRefinement(String workingDirectory, ProcessingOptionsMPARefine optionsMPA, Species[] allSpecies, DataSource dataSource, Image gainRef, DefectModel defectMap, Action`1 progressCallback)
at WarpWorker.WarpWorker.EvaluateCommand(NamedSerializableObject Command)
thanks, that's definitely a bug... if you have a reproducible example and can send files over that would make debugging easier!
Damn, I thought I had caught all multi-species bugs. Is this perfectly reproducible, or non-deterministic?
As a side note, I wonder why we're not seeing the code line numbers mentioned in the exception message. I just checked, and we're definitely shipping the corresponding .pdb files (nothing to do with atomic models) in our Conda package. @jmdobbs is this a custom build? Can you check if there is an MCore.pdb in the same directory as the MCore binary you're using?
Sorry for the delay on this, I wanted to work with my nicely aligned subtomograms a bit :)
It is at least somewhat (mostly?) reproducible. I ran
module load warp/2.0.0dev20.20240710-foss-2022a-CUDA-11.8.0
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocusexhaustive
successfully, but then ran the following (only difference= --ctf_defocus, rather than --ctf_defocusexhaustive) 3 times, where it crashed each time
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus
I then split the population file into 3 populations containing single species only, where I was able to refine/reconstruct them individually without issue.
It is a custom build, maybe @ThomasHoffmann77 can offer some clarity on this, but I don't see MCore.pdb in the dev20 directory. I don't see anything but the programs, actually, so I assume some stuff is in another location.
In our dev19 conda build I can see MCore.pdb. Unfortunately we can't have newer conda builds, for now, which is why I hadn't tried this on a conda build yet. With the dev19 conda build, I ran:
module load warp/2.0.0dev19-conda
WARP_DEBUG=1
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus
The result was the same
System.Collections.Generic.KeyNotFoundException: The given key 'Warp.Sociology.Species' was not present in the dictionary.
at Warp.TiltSeries.<>c__DisplayClass100_2.<PerformMultiParticleRefinement>b__29(Double[] input) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpLib/TiltSeries.cs:line 6895
at Accord.Math.Optimization.NonlinearObjectiveFunction.CheckGradient(Func`2 value, Double[] probe)
at Accord.Math.Optimization.BaseGradientOptimizationMethod.Maximize()
at Warp.TiltSeries.PerformMultiParticleRefinement(String workingDirectory, ProcessingOptionsMPARefine optionsMPA, Species[] allSpecies, DataSource dataSource, Image gainRef, DefectModel defectMap, Action`1 progressCallback) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpLib/TiltSeries.cs:line 7513
at WarpWorker.WarpWorker.EvaluateCommand(NamedSerializableObject Command) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpWorker/WarpWorker.cs:line 723
I am now running several M runs with different parameters, I'll report back on the results.
That's the line number I was looking for, thank you!
The latest commit should fix this bug. I also hope I fixed #177 with a previous commit, please let me know if that's the case.
(--ctf_defocusexhaustive didn't trigger the bug because it doesn't trigger anything unless --ctf_defocus is also set. I've clarified this in the help text now.)
It is a custom build, maybe @ThomasHoffmann77 can offer some clarity on this, but I don't see MCore.pdb in the dev20 directory. I don't see anything but the programs, actually, so I assume some stuff is in another location.
yes, I can confirm that we build the HPC module without debug symbos using the dotnet publish parameters "-p:DebugType=None -p:DebugSymbols=false" (see https://github.com/easybuilders/easybuild-easyconfigs/blob/5f9ef17f6c9658d64cbca326e55255bd61ecce02/easybuild/easyconfigs/w/warp/warp-2.0.0dev0_cmake.patch). I can build a debug-module on demand.
The latest commit should fix this bug. I also hope I fixed #177 with a previous commit, please let me know if that's the case.
On dev22, the issue in #177 seems to be fixed, but I still get the same KeyNotFound exception when trying to refine ctf.
Oddly, I also don't see the updated "only works in combination with ctf_defocus" text either
I see now that those changes are not in dev22, never mind!
thanks @ThomasHoffmann77 and @jmdobbs - 2.0.0dev23
just pushed
https://github.com/warpem/warp/actions/runs/9940747597
will close this in a few days assuming the problem is solved unless we hear back!
On dev23 I notice different behavior, but the error still occurs. Previously it errored out during the refinement, at tomostar 20/154, now it errors out only for defocusexhaustive and only at the very end (during the committing changes part). See below:
submitted:
module purge
module load warp
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus --ctf_defocusexhaustive
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus
job log:
Please set the environment variable $IMOD_CALIB_DIR if appropriate.
Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
0/3 1/3 2/3 3/3
Performing refinement
Preparing population for data source fid_all...Done
Loading gain reference for fid_all... Done
Refining all series in data source...
0/154 1/154 2/154 3/154 4/154 5/154 6/154 7/154 8/154 9/154 10/154 11/154 12/154 13/154 14/154 15/154 16/154 17/154 18/154 19/154 20/154 21/154 22/154 23/154 24/154 25/154 26/154 27/154 28/154 29/154 30/154 31/154 32/154 33/154 34/154 35/154 36/154 37/154 38/154 39/154 40/154 41/154 42/154 43/154 44/154 45/154 46/154 47/154 48/154 49/154 50/154 51/154 52/154 53/154 54/154 55/154 56/154 57/154 58/154 59/154 60/154 61/154 62/154 63/154 64/154 65/154 66/154 67/154 68/154 69/154 70/154 71/154 72/154 73/154 74/154 75/154 76/154 77/154 78/154 79/154 80/154 81/154 82/154 83/154 84/154 85/154 86/154 87/154 88/154 89/154 90/154 91/154 92/154 93/154 94/154 95/154 96/154 97/154 98/154 99/154 100/154 101/154 102/154 103/154 104/154 105/154 106/154 107/154 108/154 109/154 110/154 111/154 112/154 113/154 114/154 115/154 116/154 117/154 118/154 119/154 120/154 121/154 122/154 123/154 124/154 125/154 126/154 127/154 128/154 129/154 130/154 131/154 132/154 133/154 134/154 135/154 136/154 137/154 138/154 139/154 140/154 141/154 142/154 143/154 144/154 145/154 146/154 147/154 148/154 149/154 150/154 151/154 152/154 153/154 154/154
Commiting changes in fid_all...Unhandled exception.Unhandled exception. Unhandled exception. System.NotImplementedException: The method or operation is not implemented.
at MCore.MCore.WorkerDied(Object sender, EventArgs e)
at Warp.WorkerWrapper.ReportDeath()
at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
at MCore.MCore.WorkerDied(Object sender, EventArgs e)
at Warp.WorkerWrapper.ReportDeath()
at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
at MCore.MCore.WorkerDied(Object sender, EventArgs e)
at Warp.WorkerWrapper.ReportDeath()
at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()
/var/spool/slurm/job6990181/slurm_script: line 12: 394279 Aborted (core dumped) MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus --ctf_defocusexhaustive
Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
0/3 1/3 2/3 3/3
Performing refinement
Preparing population for data source fid_all...Done
Loading gain reference for fid_all... Done
Refining all series in data source...
0/154 1/154 2/154 3/154 4/154 5/154 6/154 7/154 8/154 9/154 10/154 11/154 12/154 13/154 14/154 15/154 16/154 17/154 18/154 19/154 20/154 21/154 22/154 23/154 24/154 25/154 26/154 27/154 28/154 29/154 30/154 31/154 32/154 33/154 34/154 35/154 36/154 37/154 38/154 39/154 40/154 41/154 42/154 43/154 44/154 45/154 46/154 47/154 48/154 49/154 50/154 51/154 52/154 53/154 54/154 55/154 56/154 57/154 58/154 59/154 60/154 61/154 62/154 63/154 64/154 65/154 66/154 67/154 68/154 69/154 70/154 71/154 72/154 73/154 74/154 75/154 76/154 77/154 78/154 79/154 80/154 81/154 82/154 83/154 84/154 85/154 86/154 87/154 88/154 89/154 90/154 91/154 92/154 93/154 94/154 95/154 96/154 97/154 98/154 99/154 100/154 101/154 102/154 103/154 104/154 105/154 106/154 107/154 108/154 109/154 110/154 111/154 112/154 113/154 114/154 115/154 116/154 117/154 118/154 119/154 120/154 121/154 122/154 123/154 124/154 125/154 126/154 127/154 128/154 129/154 130/154 131/154 132/154 133/154 134/154 135/154 136/154 137/154 138/154 139/154 140/154 141/154 142/154 143/154 144/154 145/154 146/154 147/154 148/154 149/154 150/154 151/154 152/154 153/154 154/154
Commiting changes in fid_all...Done
Saving intermediate refinement results for fid_all...Done
Finishing refinement
Gathering intermediate results, then reconstructing and filtering...
0/3 sp1: 4.08 Å
1/3 sp2: 6.94 Å
2/3 sp3: 5.36 Å
3/3
worker crash report:
System.Collections.Generic.KeyNotFoundException: The given key 'Warp.Sociology.Species' was not present in the dictionary.
at Warp.TiltSeries.<>c__DisplayClass100_2.<PerformMultiParticleRefinement>b__29(Double[] input)
at Accord.Math.Optimization.NonlinearObjectiveFunction.CheckGradient(Func`2 value, Double[] probe)
at Accord.Math.Optimization.BaseGradientOptimizationMethod.Maximize()
at Warp.TiltSeries.PerformMultiParticleRefinement(String workingDirectory, ProcessingOptionsMPARefine optionsMPA, Species[] allSpecies, DataSource dataSource, Image gainRef, DefectModel defectMap, Action`1 progressCallback)
at WarpWorker.WarpWorker.EvaluateCommand(NamedSerializableObject Command)
@jmdobbs apologies for the delay here, I've been traveling. Could you run the same command with the conda build which will give us line numbers when it crashes?
cc @dtegunov this seems like it might be the last of the multispecies/multiGPU shenanigans
@alisterburt no problem at all, thank you for all your work. Unfortunately I can't, I would have already done so, but for administrative reasons we can no longer have new conda builds. If the line number would be really helpful (let us know), Thomas has mentioned he could build a new debug module and I could retry with that.
huh, okay thanks for the insight - the line number was definitely useful for Dimitry last time so it might be worth moving forward.
Out of interest, what is the issue with the conda builds? I'm aware of recent license debacle with anaconda the company but we build against and depend packages from conda-forge, nothing from anaconda's repositories...
@dtegunov is taking a look at this
Hey @jmdobbs! Unfortunately, this one evades my understanding. I thought it would be somewhere in the exhaustive search code since it's triggered by that flag, but I don't see any deviations from the gradient-based search code, which runs ok. Having the line numbers would really help. @ThomasHoffmann77 bundling the release build with debug symbols shouldn't affect the performance.
Thanks to Thomas' quick work, I've been able to do some tests on a dev23 debug module. However, I've been unable to reproduce the error. I wonder if it may be GPU specific somehow, so I've changed my script to keep track of the gpu type, and will report back if I manage to reproduce it consistently. Otherwise, it seems to work well, dev23 has mostly fixed the issue!
Thank you for checking! This is extremely unlikely to be influenced by any aspects of the GPU since it's deep in the C# business logic code. Is it possible the first run somehow still used an older build?
I can't exclude that possibility, but I think it's unlikely due to the different behavior. Before dev23, it was 100% consistent that it errored out on TS20 (presumably the first place there were 0 particles in one species), but here it crashed at the committing results stage. I'll try a few different things to try and reproduce it again, but otherwise I guess this could be closed
Hi,
I've been running multispecies refinement of 3 species at 2Apx, box ~240 and have been running into errors which I assume are likely to be memory related (given the large particles), see below. However, I'm only doing --perdevice_refine 1 and asking for loads of memory in the slurm script.
Note that this does not always crash, I had System.NotImplementedException when I did just poses and imagewarp, but it ran through when I did poses, image warp, volume warp, angles, and ctf_defocusexhaustive (on the same gpu, with the same submission command). It may be random?
On the windows version, this was doable even with much weaker GPUs, so maybe there is some way you can limit the memory requirements? If this is memory related, that is.
I recently used the highest memory gpu node we have, an A40 node with a single gpu and --cpus-per-gpu 64 --mem-per-gpu 514286. See below for the error log