warpem / warp

https://warpem.github.io/warp/
GNU General Public License v3.0
29 stars 6 forks source link

fix for multi-species denoising during refinement broke initial denoiser training during species creation #177

Closed FlorianBeckOle closed 2 days ago

FlorianBeckOle commented 2 weeks ago

Hi

I followed the tutorial until create species. The command below gives the following error. One thing I noticed it the path /usr/share/miniconda/envs/package-build/conda-bld/warp_1720036028036/work/MTools/MTools.cs which is differs from my installation: which MTools /fs/pool/pool-bmapps/hpcl8/app/soft/WARP/2.0.0dev18/conda3/envs/warp/bin/MTools

Am I doing something wrong ?

thanks

Florian

testWarp:>MTools create_species --population m/10491.population --name apoferritin --diameter 130 --sym O --temporal_samples 1 --half1 relion/Refine3D/job002/run_half1_class001_unfil.mrc --half2 relion/Refine3D/job006/run_half2_class001_unfil.mrc --mask m/mask_4apx.mrc --particles_relion relion/Refine3D/job002/run_data.star --angpix_resample 0.7894 --lowpass 10 Running command create_species with: population = m/10491.population name = apoferritin diameter = 130 sym = O temporal_samples = 1 half1 = relion/Refine3D/job002/run_half1_class001_unfil.mrc half2 = relion/Refine3D/job006/run_half2_class001_unfil.mrc mask = m/mask_4apx.mrc angpix = angpix_resample = 0.7894 lowpass = 10 particles_relion = relion/Refine3D/job002/run_data.star particles_m = angpix_coords = angpix_shifts = ignore_unmatched = False

Reading maps... Done --angpix not specified, using 4.0000 A/px from half-map. Resampling maps to 0.7894 A/px... Done Padding or cropping half-maps to 2x molecule diameter... Done Padding or cropping mask to 2x molecule diameter... Done Processing half-maps... Done Parsing particle table... Done Calculating resolution and training denoiser model... 4/5: Training denoising: Preparing mask... done.

Preparing data: 4/5: Training denoising: Preparing map 0... Adjusting the number of iterations to 1500 to match batch size and number of maps.

4/5: Training denoising: 0/1500Unhandled exception. System.Exception: The loss function has reached an invalid value because something went wrong during training.
at Warp.NoiseNet3DTorch.TrainOnVolumes(NoiseNet3DTorch network, Image[] halves1, Image[] halves2, Image[] masks, Single angpix, Single lowpass, Single upsample, Boolean dontFlatten, Boolean performTraining, Int32 niterations, Single startFrom, Int32 batchsize, Int32 gpuprocess, Action1 progressCallback) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720036028036/work/WarpLib/NNModels/NoiseNet3DTorch.cs:line 819 at Warp.Sociology.Species.CalculateResolutionAndFilter(Single fixedResolution, Action1 progressCallback, Int32 gpuID) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720036028036/work/WarpLib/Sociology/Species.cs:line 1650 at MTools.Commands.CreateSpecies.Run(Object options) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720036028036/work/MTools/Commands/CreateSpecies.cs:line 582 at MTools.MTools.Run(Object options) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720036028036/work/MTools/MTools.cs:line 32 at CommandLine.ParserResultExtensions.WithParsed[T](ParserResult1 result, Action1 action) at MTools.MTools.Main(String[] args) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720036028036/work/MTools/MTools.cs:line 21

alisterburt commented 2 weeks ago

Hi Florian,

Don't worry about the path, I'm not sure on the exact details but often the paths you see in errors are the paths to files at build time rather than runtime.

There should be some logs inside your m folder too, do they say anything useful?

I assume you tried running a number of times, do inputs look normal otherwise?

alisterburt commented 2 weeks ago

I ran this using the latest conda build yesterday without issue, what GPU are you running on?

FlorianBeckOle commented 2 weeks ago

Hi,

I tried

Quadro RTX 5000 and

NVIDIA A40

best

Florian


Von: alisterburt @.***> Gesendet: Dienstag, 9. Juli 2024 15:06:46 An: warpem/warp Cc: Beck, Florian; Author Betreff: Re: [warpem/warp] MTools create_species error (Issue #177)

I ran this using the latest conda build yesterday without issue, what GPU are you running on?

— Reply to this email directly, view it on GitHubhttps://github.com/warpem/warp/issues/177#issuecomment-2217695219, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APAUYRK2543ORG2JISJKEWLZLPN6NAVCNFSM6AAAAABKSYTK32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXGY4TKMRRHE. You are receiving this because you authored the thread.Message ID: @.***>

FlorianBeckOle commented 2 weeks ago

My Install:

MTools --version MTools 2.0.0+952054dad7ef651712bb325b0d8e2702aceaf811


Von: alisterburt @.***> Gesendet: Dienstag, 9. Juli 2024 15:06:46 An: warpem/warp Cc: Beck, Florian; Author Betreff: Re: [warpem/warp] MTools create_species error (Issue #177)

I ran this using the latest conda build yesterday without issue, what GPU are you running on?

— Reply to this email directly, view it on GitHubhttps://github.com/warpem/warp/issues/177#issuecomment-2217695219, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APAUYRK2543ORG2JISJKEWLZLPN6NAVCNFSM6AAAAABKSYTK32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXGY4TKMRRHE. You are receiving this because you authored the thread.Message ID: @.***>

alisterburt commented 2 weeks ago

There should be some logs inside your m folder too, do they say anything useful?

I assume you tried running a number of times, do inputs look normal otherwise?

FlorianBeckOle commented 2 weeks ago

Hi,

sorry did not find any logs:

ls -lrta m/species/apoferritin_3c57475c/ total 2 drwxr-xr-x 2 fbeck b_cryo-em_tech 4096 Jul 9 14:17 . drwxr-xr-x 8 fbeck b_cryo-em_tech 4096 Jul 9 14:26 ..

ls -lrta m/species/ total 8 drwxr-xr-x 3 fbeck b_cryo-em_tech 4096 Jul 9 14:01 .. drwxr-xr-x 2 fbeck b_cryo-em_tech 4096 Jul 9 14:01 apoferritin_b86952e9 drwxr-xr-x 2 fbeck b_cryo-em_tech 4096 Jul 9 14:04 apoferritin_fa9c7148 drwxr-xr-x 2 fbeck b_cryo-em_tech 4096 Jul 9 14:07 apoferritin_8c9b9e3f drwxr-xr-x 2 fbeck b_cryo-em_tech 4096 Jul 9 14:11 apoferritin_4b240dba drwxr-xr-x 2 fbeck b_cryo-em_tech 4096 Jul 9 14:17 apoferritin_3c57475c drwxr-xr-x 8 fbeck b_cryo-em_tech 4096 Jul 9 14:26 . drwxr-xr-x 2 fbeck b_cryo-em_tech 4096 Jul 9 14:26 apoferritin_dfd6a877

best

Florian


Von: alisterburt @.***> Gesendet: Dienstag, 9. Juli 2024 15:15:33 An: warpem/warp Cc: Beck, Florian; Author Betreff: Re: [warpem/warp] MTools create_species error (Issue #177)

There should be some logs inside your m folder too, do they say anything useful?

I assume you tried running a number of times, do inputs look normal otherwise?

— Reply to this email directly, view it on GitHubhttps://github.com/warpem/warp/issues/177#issuecomment-2217716278, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APAUYRNDSQRKEY2CZ3PAQQDZLPO7LAVCNFSM6AAAAABKSYTK32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXG4YTMMRXHA. You are receiving this because you authored the thread.Message ID: @.***>

alisterburt commented 2 weeks ago

Those logs are there somewhere, without them I don't have enough info to help you debug

FlorianBeckOle commented 2 weeks ago

Hi,

is there any verbose flag I can set ?

ls -R M | grep log

ls -R M M: 10491.population mask_4apx.mrc species

M/species: apoferritin_3c57475c apoferritin_4b240dba apoferritin_8c9b9e3f apoferritin_b86952e9 apoferritin_dfd6a877 apoferritin_fa9c7148

M/species/apoferritin_3c57475c:

M/species/apoferritin_4b240dba:

M/species/apoferritin_8c9b9e3f:

M/species/apoferritin_b86952e9:

M/species/apoferritin_dfd6a877:

M/species/apoferritin_fa9c7148:

checked with warp

ls -R warp_tiltseries | grep log logs warp_tiltseries/logs: TS_11.log TS_17.log TS_1.log TS_23.log TS_32.log


Von: alisterburt @.***> Gesendet: Dienstag, 9. Juli 2024 15:28:09 An: warpem/warp Cc: Beck, Florian; Author Betreff: Re: [warpem/warp] MTools create_species error (Issue #177)

Those logs are there somewhere, without them I don't have enough info to help you debug

— Reply to this email directly, view it on GitHubhttps://github.com/warpem/warp/issues/177#issuecomment-2217746115, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APAUYRPCMG5CLK2OWTPXAGDZLPQOTAVCNFSM6AAAAABKSYTK32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXG42DMMJRGU. You are receiving this because you authored the thread.Message ID: @.***>

alisterburt commented 2 weeks ago

setting WARP_DEBUG=1 will add some debug output - I don't know whether there is any debug output for species creation

jmdobbs commented 2 weeks ago

I can also add: I am having the same error on our conda and optimized modules. However, sometimes I also get an extra error earlier with the nvfuser library (in this case using the dev19 conda module):

Reading maps... Done
--angpix not specified, using 5.0000 A/px from half-map.
Resampling maps to 2.0000 A/px... Done
Padding or cropping half-maps to 2x molecule diameter... Done
Padding or cropping mask to 2x molecule diameter... Done
Processing half-maps... Done
Parsing particle table... Done
Calculating resolution and training denoiser model...
4/5: Training denoising[W interface.cpp:47] Warning: Loading nvfuser library failed with: Error in dlopen: /g/easybuild/x86_64/Rocky/8/rome/software/PyTorch/2.0.1-foss-2022a-CUDA-11.8.0/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so: undefined symbol: _ZN3c106ivalue14ConstantString6createENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE (function LoadingNvfuserLibrary)
4/5: Training denoising: Preparing mask... done.                                                                                                           

Preparing data:
4/5: Training denoising: Preparing map 0... Adjusting the number of iterations to 1500 to match batch size and number of maps.                             

4/5: Training denoising: 0/1500Unhandled exception. System.Exception: The loss function has reached an invalid value because something went wrong during training.
   at Warp.NoiseNet3DTorch.TrainOnVolumes(NoiseNet3DTorch network, Image[] halves1, Image[] halves2, Image[] masks, Single angpix, Single lowpass, Single upsample, Boolean dontFlatten, Boolean performTraining, Int32 niterations, Single startFrom, Int32 batchsize, Int32 gpuprocess, Action`1 progressCallback) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpLib/NNModels/NoiseNet3DTorch.cs:line 819
   at Warp.Sociology.Species.CalculateResolutionAndFilter(Single fixedResolution, Action`1 progressCallback, Int32 gpuID) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/WarpLib/Sociology/Species.cs:line 1650
   at MTools.Commands.CreateSpecies.Run(Object options) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/MTools/Commands/CreateSpecies.cs:line 582
   at MTools.MTools.Run(Object options) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/MTools/MTools.cs:line 32
   at CommandLine.ParserResultExtensions.WithParsed[T](ParserResult`1 result, Action`1 action)
   at MTools.MTools.Main(String[] args) in /usr/share/miniconda/envs/package-build/conda-bld/warp_1720472789965/work/MTools/MTools.cs:line 21
Aborted (core dumped)

This is on dev19. If I run on dev14, there is no issue (with the exact same command and parameters). I think the issue appeared when the multi-species problem was fixed.

alisterburt commented 2 weeks ago

huh, thanks for the confirmation @jmdobbs - I'll have to see if anything changed between then and now

alisterburt commented 2 weeks ago

@jmdobbs I can't reproduce and can't see any changes to the noisenet models themselves https://github.com/warpem/warp/commits/main/WarpLib/NNModels/NoiseNet3DTorch.cs

I haven't checked more deeply, maybe something called from there changed... without a reproducible example I can't debug. If you could find between which releases it broke that would narrow down the range of changes significantly

jmdobbs commented 2 weeks ago

@alisterburt I can confirm that, on our system, dev15 works and dev17 does not. We don't have dev16 so I can't nail it down exactly. This issue is 100% consistent for us as far as I know.

The exact command I used is below, but I think in all cases we've tried to create species (me and others) this has come up:

MTools create_species -p m_testing/test.population -n testing -d 300 --angpix_resample 2 --lowpass 15 --half1 /struct/mahamid/jdobbs/path/run_half1_class001_unfil.mrc --half2 /struct/mahamid/jdobbs/path/run_half2_class001_unfil.mrc --mask /struct/mahamid/jdobbs/path/mask.mrc --particles_relion /struct/mahamid/jdobbs/path/run_data.star
alisterburt commented 2 weeks ago

Thanks @jmdobbs

dev 15 is fc90124 dev 17 is c50b893

The commit between those two which I suspect is causing the issue is be93c32 which was a confirmed fix for #156

@jmdobbs is it correct to say that this commit

jmdobbs commented 2 weeks ago

Yes, that definitely matches with what we're observed. E.g. two days ago I ran multi-species successfully on dev20 (though quite often it fails due to the issue in #179) using species I created with dev14 because species creation was not working on dev20.

alisterburt commented 2 weeks ago

You're a machine @jmdobbs - this is incredibly useful

alisterburt commented 1 week ago

should be closed by https://github.com/warpem/warp/commit/f855473fe83aeffb58bed2bca8c4b4eb29ee474b

alisterburt commented 2 days ago

assuming fixed, please reopen if necessary 🙂