Failure to run train - Githubissues

zhanghj59 commented 2 months ago

Hi,

I am trying to run the drgnai train, but failure with some errors. I am not sure if there is something wrong with my input file. I use particles.star from relion-particles-extraction. The file and the error massage are as follows:

# version 30001

data_particles

loop_
_rlnCoordinateX #1
_rlnCoordinateY #2
_rlnAnglePsi #3
_rlnClassNumber #4
_rlnAutopickFigureOfMerit #5
_rlnImageName #6
_rlnMicrographName #7
_rlnOpticsGroup #8
_rlnCtfMaxResolution #9
_rlnCtfFigureOfMerit #10
_rlnDefocusU #11
_rlnDefocusV #12
_rlnDefocusAngle #13
_rlnCtfBfactor #14
_rlnCtfScalefactor #15
_rlnPhaseShift #16
_rlnAngleRot #17
_rlnAngleTilt #18
_rlnOriginXAngst #19
_rlnOriginYAngst #20
_rlnNormCorrection #21
_rlnLogLikeliContribution #22
_rlnMaxValueProbDistribution #23
_rlnNrOfSignificantSamples #24
_rlnGroupName #25
 2576.000000   212.000000   121.488377           12     0.602679 000001@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942
    0.000000     1.000000     0.000000     0.000000     0.000000     -7.60254    -10.76754     0.658928 14259.532281     0.296648
 4760.000000  1132.000000   -165.38662           12     0.255255 000002@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942
    0.000000     1.000000     0.000000     0.000000     0.000000     5.057460     -1.27254     0.624937 14329.113093     0.899794
 4616.000000  1352.000000    -97.88662           12     0.360992 000003@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942
    0.000000     1.000000     0.000000     0.000000     0.000000     -4.43754    11.387460     0.628000 14372.570548     0.382559
 1672.000000  1932.000000    -41.63662           12     0.377718 000004@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942
    0.000000     1.000000     0.000000     0.000000     0.000000     1.892460     -4.43754     0.627556 14394.861171     0.507937
 1820.000000  2168.000000    70.863377           12     0.415047 000005@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942
    0.000000     1.000000     0.000000     0.000000     0.000000     1.892460     -4.43754     0.624716 14379.502107     0.876613
 2476.000000  2432.000000    -97.88662           12     0.236454 000006@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942
    0.000000     1.000000     0.000000     0.000000     0.000000     1.892460    -10.76754     0.647760 14334.581067     0.411722
 4720.000000  2584.000000   -131.63662           12     0.582083 000007@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942
    0.000000     1.000000     0.000000     0.000000     0.000000     -1.27254     -4.43754     0.645655 14304.960065     0.448409
 4288.000000  2848.000000    87.738377           12     0.255089 000008@Extract/job009/movies/FoilHole_4263761_Data_4219626_4219628_2
onCorr/job002/movies/FoilHole_4263761_Data_4219626_4219628_20230105_143619_fractions.mrc            1     7.797826     0.061968 16942

error massage:

(drgnai) [zhanghj@mgmt proteasome-test]$ drgnai train drgnai_test1
(INFO) (reconstruct.py) (29-Aug-24 23:04:59) Using existing output directory which does not yet contain any drgnai output!.
(INFO) (reconstruct.py) (29-Aug-24 23:05:02) Number of available gpus: 4
(INFO) (reconstruct.py) (29-Aug-24 23:05:02) Use cuda True
(INFO) (reconstruct.py) (29-Aug-24 23:05:02) Will write tensorboard summaries in drgnai_test1/out/summaries
(INFO) (reconstruct.py) (29-Aug-24 23:05:02) Creating dataset
Traceback (most recent call last):
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '_rlnImageName'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/dataset.py", line 30, in load_particles
    particles = starfile.Starfile.load(mrcs_txt_star, relion31=relion31).get_particles(datadir=datadir,
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/starfile.py", line 90, in get_particles
    particles = self.df['_rlnImageName']
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: '_rlnImageName'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '_rlnImageName'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tools/miniconda3/envs/drgnai/bin/drgnai", line 8, in <module>
    sys.exit(run_cryodrgn_ai())
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/command_line.py", line 170, in run_cryodrgn_ai
    args.func(args)
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/command_line.py", line 305, in train_experiment
    trainer = ModelTrainer(args.outdir, configs, args.load)
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/reconstruct.py", line 264, in __init__
    self.data = dataset.MRCData(
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/dataset.py", line 90, in __init__
    particles_real = load_particles(mrcfile, lazy=False, datadir=datadir, relion31=relion31)
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/dataset.py", line 35, in load_particles
    particles = starfile.Starfile.load(mrcs_txt_star, relion31=relion31).get_particles(datadir=datadir,
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/cryodrgnai/starfile.py", line 90, in get_particles
    particles = self.df['_rlnImageName']
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/tools/miniconda3/envs/drgnai/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: '_rlnImageName'

Thank you very much for your assistance.

michal-g commented 2 months ago

Hi, sorry to hear you are having trouble! To help figure out what's going on, can you tell us: 1) How you installed DRGN-AI, and if possible, the version of the installed package. 2) The contents of your drgnai_test1/configs.yaml file.

zhanghj59 commented 2 months ago

Thanks!

I installed DRGN-AI (version 1.0.1)using the following script:

The configs.ymal file

[zhanghj@spgpu drgnai_test1]$ more configs.yaml
particles: /share/zhanglab/zhanghj_data/projects/proteasome-test/Extract/job009/particles.star
ctf: /share/zhanglab/zhanghj_data/projects/proteasome-test/Extract/job009/ctf.pkl
pose: null
quick_config:
capture_setup: spa
reconstruction_type: het
pose_estimation: abinit
conf_estimation: autodecoder

I have tried to put the particles.star and the ctf.pkl file in the Extract/job009/movies or the work_directory of relion, but it didn't work.

The conda version we use is conda 23.1.0.

I use the following script to activate drgnai:

[zhanghj@spgpu drgnai_test1]$ source /tools/miniconda3/etc/profile.d/conda.sh
[zhanghj@spgpu drgnai_test1]$ conda activate drgnai
(drgnai) [zhanghj@spgpu drgnai_test1]$ drgnai test
Installation was successful!

michal-g commented 2 months ago

Can you try adding the line relion31: True to your configs.yaml and running again?

We also recommend using v0.3.1 instead of v1.0.1, especially if doing ab-initio reconstruction, as we found a bug in the pose search algorithm (#8)! This can be retrieved from the top of the repository tree using git fetch; git pull from within your checked-out repo.

zhanghj59 commented 2 months ago

Hi! @michal-g

I have tried adding the line relion31: True to my configs.yaml and running again using v0.3.1. But it still failed again.

Most of the time, the server crashed when the first round of training begun without any error messages.

configs.yaml file:

particles: /share/zhanglab/zhanghj_data/projects/proteasome-test/Extract/job010/particles.star
ctf: /share/zhanglab/zhanghj_data/projects/proteasome-test/Extract/job010/ctf.pkl
pose: null
relion31: ture
quick_config:
  capture_setup: spa
  reconstruction_type: het
  pose_estimation: abinit
  conf_estimation: autodecoder

(WARNING) (reconstruct.py) (11-Sep-24 23:01:20) Output directory `out/` already exists here!.Renaming the old one to `old-out_005_abinit-het4`.
(INFO) (reconstruct.py) (11-Sep-24 23:01:25) Number of available gpus: 4
(INFO) (reconstruct.py) (11-Sep-24 23:01:25) Use cuda True
(INFO) (reconstruct.py) (11-Sep-24 23:01:25) Will write tensorboard summaries in drgnai_test1/out/summaries
(INFO) (reconstruct.py) (11-Sep-24 23:01:25) Creating dataset
(INFO) (dataset.py) (11-Sep-24 23:04:08) Loaded 144702 128x128 images
(INFO) (dataset.py) (11-Sep-24 23:04:08) Windowing images with radius 0.85
(INFO) (dataset.py) (11-Sep-24 23:04:09) Computing FFT
(INFO) (dataset.py) (11-Sep-24 23:04:09) Spawning 16 processes
(INFO) (dataset.py) (11-Sep-24 23:05:07) Symmetrizing image data
(INFO) (dataset.py) (11-Sep-24 23:05:21) Normalized HT by 0 +/- 102.6186752319336
(INFO) (dataset.py) (11-Sep-24 23:05:35) Normalized real space images by 0.011543781496584415 +/- 0.8028228878974915
(INFO) (reconstruct.py) (11-Sep-24 23:05:38) Loading ctf params from /share/zhanglab/zhanghj_data/projects/proteasome-test/Extract/job010/ctf.pkl
(INFO) (ctf.py) (11-Sep-24 23:05:38) Image size (pix)  : 128
(INFO) (ctf.py) (11-Sep-24 23:05:38) A/pix             : 1.978124976158142
(INFO) (ctf.py) (11-Sep-24 23:05:38) DefocusU (A)      : 16942.06640625
(INFO) (ctf.py) (11-Sep-24 23:05:38) DefocusV (A)      : 16821.533203125
(INFO) (ctf.py) (11-Sep-24 23:05:38) Dfang (deg)       : 32.922245025634766
(INFO) (ctf.py) (11-Sep-24 23:05:38) voltage (kV)      : 300.0
(INFO) (ctf.py) (11-Sep-24 23:05:38) cs (mm)           : 2.700000047683716
(INFO) (ctf.py) (11-Sep-24 23:05:38) w                 : 0.10000000149011612
(INFO) (ctf.py) (11-Sep-24 23:05:38) Phase shift (deg) : 0.0
(INFO) (reconstruct.py) (11-Sep-24 23:05:39) Building lattice
(INFO) (reconstruct.py) (11-Sep-24 23:05:39) Heterogeneous reconstruction with z_dim = 4
(INFO) (reconstruct.py) (11-Sep-24 23:05:39) Initializing model...
(INFO) (reconstruct.py) (11-Sep-24 23:05:39) DrgnAI(
  (pose_table): PoseTable()
  (conf_table): ConfTable()
  (hypervolume): HyperVolume(
    (mlp): ResidualLinearMLP(
      (main): Sequential(
        (0): Linear(in_features=388, out_features=256, bias=True)
        (1): ReLU()
        (2): ResidualLinear(
          (linear): Linear(in_features=256, out_features=256, bias=True)
        )
        (3): ReLU()
        (4): ResidualLinear(
          (linear): Linear(in_features=256, out_features=256, bias=True)
        )
        (5): ReLU()
        (6): ResidualLinear(
          (linear): Linear(in_features=256, out_features=256, bias=True)
        )
        (7): ReLU()
        (8): MyLinear(in_features=256, out_features=1, bias=True)
      )
    )
  )
)
(INFO) (reconstruct.py) (11-Sep-24 23:05:39) 2033641 parameters in model
(INFO) (reconstruct.py) (11-Sep-24 23:05:39) Model initialized. Moving to GPU...
(INFO) (reconstruct.py) (11-Sep-24 23:05:42) --- Training Starts Now ---
(INFO) (reconstruct.py) (11-Sep-24 23:05:42) Will pretrain on 10000 particles
(INFO) (reconstruct.py) (11-Sep-24 23:05:42) Will make a full summary at the end of this epoch
(INFO) (reconstruct.py) (11-Sep-24 23:05:57) # [Train Epoch: -1/103] [10112/144702 particles]
(INFO) (reconstruct.py) (11-Sep-24 23:05:58) # =====> SGD Epoch: -1 finished in 0:00:15.502928; total loss = 144509901.308861
(INFO) (analysis.py) (11-Sep-24 23:05:59) Explained variance ratio:
(INFO) (analysis.py) (11-Sep-24 23:05:59) [0.31302254 0.27805947 0.24219864 0.16671935]
(INFO) (reconstruct.py) (11-Sep-24 23:06:00) Will use pose search on 144702 particles
(INFO) (reconstruct.py) (11-Sep-24 23:06:00) Will make a full summary at the end of this epoch

Only once, it reported an error message.

I'm wondering if it's a software environment configuration issue or a software version issue. So could you tell me the successful environment and software versions.

Thank you very much for your assistance!

michal-g commented 1 month ago

Hi, can you try these things:

1) Double-checking the spelling of "True" in your configs.yaml, which has a typo in the message above. 2) Rerunning with CUDA_LAUNCH_BLOCK=1 as discussed here, e.g. export CUDA_LAUNCH_BLOCK=1; drgnai train drgnai_test1, which will help make the error messages more verbose. 3) Checking the version of the GPU drivers you have installed, e.g. using nvidia-smi, as this will help figure out if this is indeed a problem with the software environment!

-Mike

ml-struct-bio / drgnai

Failure to run train #6