`'Warp.Sociology.Species' was not present in the dictionary` during multispecies refinement (sometimes)

jmdobbs commented 2 weeks ago

Hi,

I've been running multispecies refinement of 3 species at 2Apx, box ~240 and have been running into errors which I assume are likely to be memory related (given the large particles), see below. However, I'm only doing --perdevice_refine 1 and asking for loads of memory in the slurm script.

Note that this does not always crash, I had System.NotImplementedException when I did just poses and imagewarp, but it ran through when I did poses, image warp, volume warp, angles, and ctf_defocusexhaustive (on the same gpu, with the same submission command). It may be random?

On the windows version, this was doable even with much weaker GPUs, so maybe there is some way you can limit the memory requirements? If this is memory related, that is.

I recently used the highest memory gpu node we have, an A40 node with a single gpu and --cpus-per-gpu 64 --mem-per-gpu 514286. See below for the error log

Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
0/3                                                                              1/3                                                                              2/3                                                                              3/3
Performing refinement
Preparing population for data source fid_all...Done
Loading gain reference for fid_all... Done
Refining all series in data source...
0/154                                                                              1/154                                                                              2/154                                                                              3/154                                                                              4/154                                                                              5/154                                                                              6/154                                                                              7/154                                                                              8/154                                                                              9/154                                                                              10/154                                                                              11/154                                                                              12/154                                                                              13/154                                                                              14/154                                                                              15/154                                                                              16/154                                                                              17/154                                                                              18/154                                                                              19/154                                                                              20/154                                                                              21/154                                                                              22/154Unhandled exception. System.Exception: Something went wrong during refinement. Sorry! Here are the details:

System.Exception: BadRequest: System.Net.Http.HttpConnectionResponseContent
   at Warp.WorkerWrapper.SendCommand(NamedSerializableObject command)
   at Warp.WorkerWrapper.MPARefine(String path, String workingDirectory, ProcessingOptionsMPARefine options, DataSource source)
   at MCore.MCore.<>c__DisplayClass16_4.<DoProcessing>b__4(Int32 ifile, Int32 threadID)
   at Warp.Tools.Helper.ForCPU(Int32 fromInclusive, Int32 toExclusive, Int32 nThreads, Action`1 funcSetup, Action`2 funcIterator, Action`1 funcTeardown)
   at MCore.MCore.DoProcessing()
   at MCore.MCore.DoProcessing()
   at MCore.MCore.Main(String[] args)
   at MCore.MCore.<Main>(String[] args)
/var/spool/slurm/job6873579/slurm_script: line 20:  3734 Aborted                 (core dumped) MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus

jmdobbs commented 2 weeks ago

Also, with --iter 0 (but not on an A40, I asked for 300GB), I get the following different error (which certainly looks like OOM)

          122/154                                                                              123/154                                                                              124/154                                                                              125/154                                                                              126/154                                                                              127/154                                                                              128/154                                                                              129/154                                                                              130/154                                                                              131/154                                                                              132/154                                                                              133/154                                                                              134/154                                                                              135/154                                                                              136/154                                                                              137/154                                                                              138/154                                                                              139/154                                                                              140/154                                                                              141/154                                                                              142/154                                                                              143/154                                                                              144/154                                                                              145/154                                                                              146/154                                                                              147/154                                                                              148/154                                                                              149/154                                                                              150/154                                                                              151/154                                                                              152/154                                                                              153/154                                                                              154/154
Commiting changes in fid_all...Unhandled exception.Unhandled exception.Unhandled exception.   Unhandled exception. System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e)
   at Warp.WorkerWrapper.ReportDeath()
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e)
   at Warp.WorkerWrapper.ReportDeath()
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e)
   at Warp.WorkerWrapper.ReportDeath()
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e)
   at Warp.WorkerWrapper.ReportDeath()
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()

/var/spool/slurm/job6874236/slurm_script: line 15: 734064 Aborted                 (core dumped) TF_FORCE_UNIFIED_MEMORY='1' XLA_PYTHON_CLIENT_MEM_FRACTION='4.0' MCore --population m3/fid_all_v3.population --perdevice_refine 1 --iter 0

alisterburt commented 2 weeks ago

Hi @jmdobbs - I'm not aware of how M has changed internally over time so can't really comment. I'll chat to Dimitry about it but he's out at a conference until the end of the week.

Could you post worker logs rather than the stdout please? They're more useful for seeing what's going on

jmdobbs commented 2 weeks ago

Thanks @alisterburt. I had the output going into a different directory and forgot about the worker logs.

Interestingly, it doesn't appear to be memory related after all: it can't find the species (sometimes). As I mentioned earlier, it sometimes runs through with no issue.

System.Collections.Generic.KeyNotFoundException: The given key 'Warp.Sociology.Species' was not present in the dictionary.
   at Warp.TiltSeries.<>c__DisplayClass100_2.<PerformMultiParticleRefinement>b__29(Double[] input)
   at Accord.Math.Optimization.NonlinearObjectiveFunction.CheckGradient(Func`2 value, Double[] probe)
   at Accord.Math.Optimization.BaseGradientOptimizationMethod.Maximize()
   at Warp.TiltSeries.PerformMultiParticleRefinement(String workingDirectory, ProcessingOptionsMPARefine optionsMPA, Species[] allSpecies, DataSource dataSource, Image gainRef, DefectModel defectMap, Action`1 progressCallback)
   at WarpWorker.WarpWorker.EvaluateCommand(NamedSerializableObject Command)

alisterburt commented 2 weeks ago

thanks, that's definitely a bug... if you have a reproducible example and can send files over that would make debugging easier!

dtegunov commented 2 weeks ago

Damn, I thought I had caught all multi-species bugs. Is this perfectly reproducible, or non-deterministic?

As a side note, I wonder why we're not seeing the code line numbers mentioned in the exception message. I just checked, and we're definitely shipping the corresponding .pdb files (nothing to do with atomic models) in our Conda package. @jmdobbs is this a custom build? Can you check if there is an MCore.pdb in the same directory as the MCore binary you're using?

jmdobbs commented 2 weeks ago

Sorry for the delay on this, I wanted to work with my nicely aligned subtomograms a bit :)

It is at least somewhat (mostly?) reproducible. I ran

module load warp/2.0.0dev20.20240710-foss-2022a-CUDA-11.8.0 
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocusexhaustive

successfully, but then ran the following (only difference= --ctf_defocus, rather than --ctf_defocusexhaustive) 3 times, where it crashed each time

MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus

I then split the population file into 3 populations containing single species only, where I was able to refine/reconstruct them individually without issue.

It is a custom build, maybe @ThomasHoffmann77 can offer some clarity on this, but I don't see MCore.pdb in the dev20 directory. I don't see anything but the programs, actually, so I assume some stuff is in another location.

In our dev19 conda build I can see MCore.pdb. Unfortunately we can't have newer conda builds, for now, which is why I hadn't tried this on a conda build yet. With the dev19 conda build, I ran:

module load warp/2.0.0dev19-conda
WARP_DEBUG=1
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus

The result was the same

System.Collections.Generic.KeyNotFoundException: The given key 'Warp.Sociology.Species' was not present in the dictionary.
   at Warp.TiltSeries.<>c__DisplayClass100_2.<PerformMultiParticleRefinement>b__29(Double[] input) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpLib/TiltSeries.cs:line 6895
   at Accord.Math.Optimization.NonlinearObjectiveFunction.CheckGradient(Func`2 value, Double[] probe)
   at Accord.Math.Optimization.BaseGradientOptimizationMethod.Maximize()
   at Warp.TiltSeries.PerformMultiParticleRefinement(String workingDirectory, ProcessingOptionsMPARefine optionsMPA, Species[] allSpecies, DataSource dataSource, Image gainRef, DefectModel defectMap, Action`1 progressCallback) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpLib/TiltSeries.cs:line 7513
   at WarpWorker.WarpWorker.EvaluateCommand(NamedSerializableObject Command) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpWorker/WarpWorker.cs:line 723

I am now running several M runs with different parameters, I'll report back on the results.

dtegunov commented 2 weeks ago

That's the line number I was looking for, thank you!

The latest commit should fix this bug. I also hope I fixed #177 with a previous commit, please let me know if that's the case.

(--ctf_defocusexhaustive didn't trigger the bug because it doesn't trigger anything unless --ctf_defocus is also set. I've clarified this in the help text now.)

ThomasHoffmann77 commented 1 week ago

It is a custom build, maybe @ThomasHoffmann77 can offer some clarity on this, but I don't see MCore.pdb in the dev20 directory. I don't see anything but the programs, actually, so I assume some stuff is in another location.

yes, I can confirm that we build the HPC module without debug symbos using the dotnet publish parameters "-p:DebugType=None -p:DebugSymbols=false" (see https://github.com/easybuilders/easybuild-easyconfigs/blob/5f9ef17f6c9658d64cbca326e55255bd61ecce02/easybuild/easyconfigs/w/warp/warp-2.0.0dev0_cmake.patch). I can build a debug-module on demand.

jmdobbs commented 1 week ago

The latest commit should fix this bug. I also hope I fixed #177 with a previous commit, please let me know if that's the case.

~~On dev22, the issue in #177 seems to be fixed, but I still get the same KeyNotFound exception when trying to refine ctf.~~

~~Oddly, I also don't see the updated "only works in combination with ctf_defocus" text either~~

I see now that those changes are not in dev22, never mind!

alisterburt commented 1 week ago

thanks @ThomasHoffmann77 and @jmdobbs - 2.0.0dev23 just pushed https://github.com/warpem/warp/actions/runs/9940747597

will close this in a few days assuming the problem is solved unless we hear back!

alisterburt commented 1 week ago

should be closed by https://github.com/warpem/warp/commit/f855473fe83aeffb58bed2bca8c4b4eb29ee474b and https://github.com/warpem/warp/commit/ebb28fd6306e0dd2cf57624eebc66301a722aa6c

jmdobbs commented 1 week ago

On dev23 I notice different behavior, but the error still occurs. Previously it errored out during the refinement, at tomostar 20/154, now it errors out only for defocusexhaustive and only at the very end (during the committing changes part). See below:

submitted:

module purge
module load warp
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus --ctf_defocusexhaustive
MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus

job log:

Please set the environment variable $IMOD_CALIB_DIR if appropriate.
Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
0/3                                                                              1/3                                                                              2/3                                                                              3/3
Performing refinement
Preparing population for data source fid_all...Done
Loading gain reference for fid_all... Done
Refining all series in data source...
0/154                                                                              1/154                                                                              2/154                                                                              3/154                                                                              4/154                                                                              5/154                                                                              6/154                                                                              7/154                                                                              8/154                                                                              9/154                                                                              10/154                                                                              11/154                                                                              12/154                                                                              13/154                                                                              14/154                                                                              15/154                                                                              16/154                                                                              17/154                                                                              18/154                                                                              19/154                                                                              20/154                                                                              21/154                                                                              22/154                                                                              23/154                                                                              24/154                                                                              25/154                                                                              26/154                                                                              27/154                                                                              28/154                                                                              29/154                                                                              30/154                                                                              31/154                                                                              32/154                                                                              33/154                                                                              34/154                                                                              35/154                                                                              36/154                                                                              37/154                                                                              38/154                                                                              39/154                                                                              40/154                                                                              41/154                                                                              42/154                                                                              43/154                                                                              44/154                                                                              45/154                                                                              46/154                                                                              47/154                                                                              48/154                                                                              49/154                                                                              50/154                                                                              51/154                                                                              52/154                                                                              53/154                                                                              54/154                                                                              55/154                                                                              56/154                                                                              57/154                                                                              58/154                                                                              59/154                                                                              60/154                                                                              61/154                                                                              62/154                                                                              63/154                                                                              64/154                                                                              65/154                                                                              66/154                                                                              67/154                                                                              68/154                                                                              69/154                                                                              70/154                                                                              71/154                                                                              72/154                                                                              73/154                                                                              74/154                                                                              75/154                                                                              76/154                                                                              77/154                                                                              78/154                                                                              79/154                                                                              80/154                                                                              81/154                                                                              82/154                                                                              83/154                                                                              84/154                                                                              85/154                                                                              86/154                                                                              87/154                                                                              88/154                                                                              89/154                                                                              90/154                                                                              91/154                                                                              92/154                                                                              93/154                                                                              94/154                                                                              95/154                                                                              96/154                                                                              97/154                                                                              98/154                                                                              99/154                                                                              100/154                                                                              101/154                                                                              102/154                                                                              103/154                                                                              104/154                                                                              105/154                                                                              106/154                                                                              107/154                                                                              108/154                                                                              109/154                                                                              110/154                                                                              111/154                                                                              112/154                                                                              113/154                                                                              114/154                                                                              115/154                                                                              116/154                                                                              117/154                                                                              118/154                                                                              119/154                                                                              120/154                                                                              121/154                                                                              122/154                                                                              123/154                                                                              124/154                                                                              125/154                                                                              126/154                                                                              127/154                                                                              128/154                                                                              129/154                                                                              130/154                                                                              131/154                                                                              132/154                                                                              133/154                                                                              134/154                                                                              135/154                                                                              136/154                                                                              137/154                                                                              138/154                                                                              139/154                                                                              140/154                                                                              141/154                                                                              142/154                                                                              143/154                                                                              144/154                                                                              145/154                                                                              146/154                                                                              147/154                                                                              148/154                                                                              149/154                                                                              150/154                                                                              151/154                                                                              152/154                                                                              153/154                                                                              154/154
Commiting changes in fid_all...Unhandled exception.Unhandled exception.  Unhandled exception. System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e)
   at Warp.WorkerWrapper.ReportDeath()
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e)
   at Warp.WorkerWrapper.ReportDeath()
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()System.NotImplementedException: The method or operation is not implemented.
   at MCore.MCore.WorkerDied(Object sender, EventArgs e)
   at Warp.WorkerWrapper.ReportDeath()
   at Warp.WorkerWrapper.<StartHeartbeat>b__14_0()

/var/spool/slurm/job6990181/slurm_script: line 12: 394279 Aborted                 (core dumped) MCore --population m3/fid_all_v3.population --perdevice_refine 1 --first_iteration_fraction 0.7 --refine_particles --refine_imagewarp 6x4 --refine_volumewarp 2x3x2x10 --refine_stageangles --ctf_defocus --ctf_defocusexhaustive
Loading population... Done
Creating directories... Done
Spawning workers... Done
Preparing for refinement – this will take a few minutes per species
Preparing refinement requisites...
0/3                                                                              1/3                                                                              2/3                                                                              3/3
Performing refinement
Preparing population for data source fid_all...Done
Loading gain reference for fid_all... Done
Refining all series in data source...
0/154                                                                              1/154                                                                              2/154                                                                              3/154                                                                              4/154                                                                              5/154                                                                              6/154                                                                              7/154                                                                              8/154                                                                              9/154                                                                              10/154                                                                              11/154                                                                              12/154                                                                              13/154                                                                              14/154                                                                              15/154                                                                              16/154                                                                              17/154                                                                              18/154                                                                              19/154                                                                              20/154                                                                              21/154                                                                              22/154                                                                              23/154                                                                              24/154                                                                              25/154                                                                              26/154                                                                              27/154                                                                              28/154                                                                              29/154                                                                              30/154                                                                              31/154                                                                              32/154                                                                              33/154                                                                              34/154                                                                              35/154                                                                              36/154                                                                              37/154                                                                              38/154                                                                              39/154                                                                              40/154                                                                              41/154                                                                              42/154                                                                              43/154                                                                              44/154                                                                              45/154                                                                              46/154                                                                              47/154                                                                              48/154                                                                              49/154                                                                              50/154                                                                              51/154                                                                              52/154                                                                              53/154                                                                              54/154                                                                              55/154                                                                              56/154                                                                              57/154                                                                              58/154                                                                              59/154                                                                              60/154                                                                              61/154                                                                              62/154                                                                              63/154                                                                              64/154                                                                              65/154                                                                              66/154                                                                              67/154                                                                              68/154                                                                              69/154                                                                              70/154                                                                              71/154                                                                              72/154                                                                              73/154                                                                              74/154                                                                              75/154                                                                              76/154                                                                              77/154                                                                              78/154                                                                              79/154                                                                              80/154                                                                              81/154                                                                              82/154                                                                              83/154                                                                              84/154                                                                              85/154                                                                              86/154                                                                              87/154                                                                              88/154                                                                              89/154                                                                              90/154                                                                              91/154                                                                              92/154                                                                              93/154                                                                              94/154                                                                              95/154                                                                              96/154                                                                              97/154                                                                              98/154                                                                              99/154                                                                              100/154                                                                              101/154                                                                              102/154                                                                              103/154                                                                              104/154                                                                              105/154                                                                              106/154                                                                              107/154                                                                              108/154                                                                              109/154                                                                              110/154                                                                              111/154                                                                              112/154                                                                              113/154                                                                              114/154                                                                              115/154                                                                              116/154                                                                              117/154                                                                              118/154                                                                              119/154                                                                              120/154                                                                              121/154                                                                              122/154                                                                              123/154                                                                              124/154                                                                              125/154                                                                              126/154                                                                              127/154                                                                              128/154                                                                              129/154                                                                              130/154                                                                              131/154                                                                              132/154                                                                              133/154                                                                              134/154                                                                              135/154                                                                              136/154                                                                              137/154                                                                              138/154                                                                              139/154                                                                              140/154                                                                              141/154                                                                              142/154                                                                              143/154                                                                              144/154                                                                              145/154                                                                              146/154                                                                              147/154                                                                              148/154                                                                              149/154                                                                              150/154                                                                              151/154                                                                              152/154                                                                              153/154                                                                              154/154
Commiting changes in fid_all...Done
Saving intermediate refinement results for fid_all...Done
Finishing refinement
Gathering intermediate results, then reconstructing and filtering...
0/3                                                                              sp1: 4.08 Å
1/3                                                                              sp2: 6.94 Å
2/3                                                                              sp3: 5.36 Å
3/3

worker crash report:

System.Collections.Generic.KeyNotFoundException: The given key 'Warp.Sociology.Species' was not present in the dictionary.
   at Warp.TiltSeries.<>c__DisplayClass100_2.<PerformMultiParticleRefinement>b__29(Double[] input)
   at Accord.Math.Optimization.NonlinearObjectiveFunction.CheckGradient(Func`2 value, Double[] probe)
   at Accord.Math.Optimization.BaseGradientOptimizationMethod.Maximize()
   at Warp.TiltSeries.PerformMultiParticleRefinement(String workingDirectory, ProcessingOptionsMPARefine optionsMPA, Species[] allSpecies, DataSource dataSource, Image gainRef, DefectModel defectMap, Action`1 progressCallback)
   at WarpWorker.WarpWorker.EvaluateCommand(NamedSerializableObject Command)

alisterburt commented 3 days ago

@jmdobbs apologies for the delay here, I've been traveling. Could you run the same command with the conda build which will give us line numbers when it crashes?

cc @dtegunov this seems like it might be the last of the multispecies/multiGPU shenanigans

jmdobbs commented 3 days ago

@alisterburt no problem at all, thank you for all your work. Unfortunately I can't, I would have already done so, but for administrative reasons we can no longer have new conda builds. If the line number would be really helpful (let us know), Thomas has mentioned he could build a new debug module and I could retry with that.

alisterburt commented 3 days ago

huh, okay thanks for the insight - the line number was definitely useful for Dimitry last time so it might be worth moving forward.

Out of interest, what is the issue with the conda builds? I'm aware of recent license debacle with anaconda the company but we build against and depend packages from conda-forge, nothing from anaconda's repositories...

alisterburt commented 3 days ago

@dtegunov is taking a look at this

dtegunov commented 3 days ago

Hey @jmdobbs! Unfortunately, this one evades my understanding. I thought it would be somewhere in the exhaustive search code since it's triggered by that flag, but I don't see any deviations from the gradient-based search code, which runs ok. Having the line numbers would really help. @ThomasHoffmann77 bundling the release build with debug symbols shouldn't affect the performance.

jmdobbs commented 1 day ago

Thanks to Thomas' quick work, I've been able to do some tests on a dev23 debug module. However, I've been unable to reproduce the error. I wonder if it may be GPU specific somehow, so I've changed my script to keep track of the gpu type, and will report back if I manage to reproduce it consistently. Otherwise, it seems to work well, dev23 has mostly fixed the issue!

dtegunov commented 1 day ago

Thank you for checking! This is extremely unlikely to be influenced by any aspects of the GPU since it's deep in the C# business logic code. Is it possible the first run somehow still used an older build?

jmdobbs commented 20 hours ago

I can't exclude that possibility, but I think it's unlikely due to the different behavior. Before dev23, it was 100% consistent that it errored out on TS20 (presumably the first place there were 0 particles in one species), but here it crashed at the committing results stage. I'll try a few different things to try and reproduce it again, but otherwise I guess this could be closed

warpem / warp

`'Warp.Sociology.Species' was not present in the dictionary` during multispecies refinement (sometimes) #179