lfads / lfads-run-manager

Matlab interface for Latent Factor Analysis via Dynamical Systems (LFADS)
https://lfads.github.io/lfads-run-manager
Apache License 2.0
50 stars 29 forks source link

minimal error message when debugging lfadsqueue #31

Open panichem opened 2 years ago

panichem commented 2 years ago

Hey Dan,

I ran into the same initial problem as charlesincharge (the tmux session encapsulation obscuring output while debugging python run_lfadsqueue.py) . As you suggested, I tried running the relevant commands directly (in this case, as a shell script):

#!/bin/bash

export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src
export PATH=$PATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
ml python/2.7
ml py-h5py/2.7.1_py27
ml py-tensorflow/1.5.0_py27
ml viz
ml py-numpy
ml py-scipy/1.1.0_py27
ml py-matplotlib
ml matlab

PATH=/share/software/user/restricted/matlab/R2020a/bin:/share/software/user/open/cups/2.2.4/bin:/share/software/user/open/gconf/2.9.91/bin:/share/software/user/open/orbit/2.14.19/bin:/share/software/user/open/libidl/0.8.14/bin:/share/softw
are/user/open/gtk+/2.24.30/bin:/share/software/user/open/gdk-pixbuf/2.36.8/bin:/share/software/user/open/gobject-introspection/1.52.1/bin:/share/software/user/open/libtiff/4.0.8/bin:/share/software/user/open/libjpeg-turbo/1.5.1/bin:/share/
software/user/open/pango/1.40.10/bin:/share/software/user/open/harfbuzz/1.4.8/bin:/share/software/user/open/icu/59.1/bin:/share/software/user/open/cairo/1.14.10/bin:/share/software/user/open/glib/2.52.3/bin:/share/software/user/open/py-mat
plotlib/2.2.2_py27/bin:/share/software/user/open/py-numpy/1.14.3_py27/bin:/share/software/user/open/python/2.7.13/bin:/share/software/user/open/sqlite/3.18.0/bin:/share/software/user/open/tcltk/8.6.6/bin:/share/software/user/open/libressl/
2.5.3/bin:/share/software/user/open/xz/5.2.3/bin:/share/software/user/open/py-tensorflow/1.5.0_py27/bin:/share/software/user/open/py-h5py/2.7.1_py27/bin:/share/software/user/open/hdf5/1.10.2/bin:/share/software/user/open/openmpi/3.1.2/bin:
/share/software/user/open/libfabric/1.6.0/bin:/share/software/user/open/ucx/1.3.1/bin:/usr/lib64/nvidia:/share/software/user/open/cuda/9.0.176/bin:/share/software/user/open/cuda/9.0.176/nvvm/bin:/share/software/user/srcc/bin:/share/softwar
e/user/open/x11/7.7/bin:/share/software/user/open/llvm/4.0.0/bin:/share/software/user/open/libxml2/2.9.4/bin:/share/software/user/open/fontconfig/2.12.4/bin:/share/software/user/open/freetype/2.8/bin:/share/software/user/open/libpng/1.2.57
/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/users/mfp2/.local/bin:/home/users/mfp2/bin:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads; export CUDA_VISIBLE_DEVICES=0; bash /oak/stan
ford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfads_train.sh

I'm getting more informative output now, although I still haven't been able to generate an informative error message:

/share/software/user/open/py-h5py/2.7.1_py27/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2022-01-23 11:00:37.031413: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2022-01-23 11:00:37.160938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:05:00.0
totalMemory: 10.76GiB freeMemory: 10.61GiB
2022-01-23 11:00:37.161017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5)
/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads/lfads.py:323: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  datasets[hps.dataset_names[0]]['train_data'].dtype, int), \
WARNING:tensorflow:From /oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads/utils.py:140: calling l2_normalize (from tensorflow.python.ops.nn_impl) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfads_train.sh: line 22: 54792 Killed                  DISPLAY=:0 python $(which run_lfads.py) --data_dir=/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfadsInput --data_filename_stem=lfads --lfads_save_dir=/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfadsOutput --allow_gpu_growth=true --max_ckpt_to_keep=5 --max_ckpt_to_keep_lve=5 --device=/gpu:0 --learning_rate_init=0.010000 --learning_rate_decay_factor=0.980000 --learning_rate_n_to_compare=6 --learning_rate_stop=0.001000 --max_grad_norm=200.000000 --batch_size=40 --cell_clip_value=5.000000 --temporal_spike_jitter_width=0 --keep_prob=0.950000 --l2_gen_scale=500.000000 --l2_con_scale=500.000000 --co_mean_corr_scale=0.000000 --kl_ic_weight=1.000000 --kl_co_weight=1.000000 --kl_start_step=0 --kl_increase_steps=900 --l2_start_step=0 --l2_increase_steps=900 --ext_input_dim=0 --inject_ext_input_to_gen=false --co_dim=0 --prior_ar_atau=10.000000 --do_train_prior_ar_atau=true --prior_ar_nvar=0.100000 --do_train_prior_ar_nvar=true --do_causal_controller=false --do_feed_factors_to_controller=true --feedback_factors_or_rates=factors --controller_input_lag=1 --ci_enc_dim=128 --con_dim=128 --co_prior_var_scale=0.100000 --num_steps_for_gen_ic=4294967295 --ic_dim=64 --ic_enc_dim=64 --ic_prior_var_min=0.100000 --ic_prior_var_scale=0.100000 --ic_prior_var_max=0.100000 --ic_post_var_min=0.000100 --cell_weight_scale=1.000000 --gen_dim=64 --gen_cell_input_weight_scale=1.000000 --gen_cell_rec_weight_scale=1.000000 --factors_dim=8 --output_dist=poisson --do_train_readin=true --tf_debug_cli=false --tf_debug_tensorboard=false --tf_debug_tensorboard_hostport=localhost:6064 --debug_verbose=true --debug_print_each_step=true

Any advice on how to troubleshoot would be appreciated!

Thanks Matt

Originally posted by @panichem in https://github.com/lfads/lfads-run-manager/issues/7#issuecomment-1019553480

djoshea commented 2 years ago

Hey Matt, as far as I can tell, I think you're getting a few warnings from h5py:

FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

And some warnings from Tensorflow:

FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.

and:

WARNING:tensorflow .... calling l2_normalize (from tensorflow.python.ops.nn_impl) with dim is deprecated and will be removed in a future version. Instructions for updating: dim is deprecated, use axis instead

Both of these are probably fixable but just reflect the age of the Tensorflow code, and I don't actually cause the script to crash, as I've gotten them too.

What actually happens is just that the process was killed:

.../single_aq_20211015_spikes/lfads_train.sh: line 22: 54792 Killed                  

On Sherlock, I'd bet this is just that you ran out of memory. If you're using sdev for this, try requesting more memory, e.g. sdev -m 64 -p shenoy? c.f. https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/#dedicated-nodes

panichem commented 2 years ago

Hi Dan, thanks for catching this! Now, when I launch an interactive session with more system memory

salloc -p gpu -gpus 1 --mem=32GB

The shell script mentioned above runs to completion!

However, I'm still running into trouble running run_lfadsqueue.py directly.

When I run

export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src
export PATH=$PATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
ml python/2.7
ml py-h5py/2.7.1_py27
ml py-tensorflow/1.5.0_py27
ml viz
ml py-numpy
ml py-scipy/1.1.0_py27
ml py-matplotlib
ml matlab

python ./run_lfadsqueue.py

The script seems to be crashing because I haven't launched tmux:

Traceback (most recent call last):
  File "./run_lfadsqueue.py", line 10, in <module>
    tasks = lq.run_lfads_queue(queue_name, tensorboard_script, task_specs, gpu_list=gpu_list, one_task_per_gpu=True)
  File "/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src/lfadsqueue.py", line 465, in run_lfads_queue
    running_tensorboard_sessions = get_list_tmux_sessions_name_starts_with(tensorboard_session_prefix)
  File "/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src/lfadsqueue.py", line 269, in get_list_tmux_sessions_name_starts_with
    return filter(lambda sess: sess.startswith(prefix), get_list_tmux_sessions())
  File "/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src/lfadsqueue.py", line 263, in get_list_tmux_sessions
    raise(exc)
subprocess.CalledProcessError: Command '['tmux', 'list-sessions', '-F', '#{session_name}']' returned non-zero exit status 1

Launching tmux and running the same chunk of code:

tmux
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src
export PATH=$PATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
ml python/2.7
ml py-h5py/2.7.1_py27
ml py-tensorflow/1.5.0_py27
ml viz
ml py-numpy
ml py-scipy/1.1.0_py27
ml py-matplotlib
ml matlab

python ./run_lfadsqueue.py

I get the following error message, which warns about nested tmux sessions and provides the same generic error that I first ran into (although I now know that with enough memory, the selection of single-quoted commands will run without error when launched directly):

Warning: tmux sessions will be nested inside the current session
Queue: Launching TensorBoard on port 46431 in tmux session exampleSingleSession_tensorboard_port46431
bash /oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/launch_tensorboard.sh --port=46431
Queue: Initializing with 1 GPUs and 32 CPUs, max 1 simultaneous tasks
Task lfads_param_bQPjV2_run001_single_aq_20211015_spikes: INTERNAL ERROR. Exception was:
Traceback (most recent call last):
  File "/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src/lfadsqueue.py", line 351, in process_launch_task_in_tmux
    raise ChildProcessError('Tmux session immediately terminated running "{}" '.format(tmux_command))
ChildProcessError: Tmux session immediately terminated running "tmux new-session -ds lfads_param_bQPjV2_run001_single_aq_20211015_spikes 'export PATH=/share/software/user/srcc/bin:/share/software/user/restricted/matlab/R2020a/bin:/share/software/user/open/cups/2.2.4/bin:/share/software/user/open/gconf/2.9.91/bin:/share/software/user/open/orbit/2.14.19/bin:/share/software/user/open/libidl/0.8.14/bin:/share/software/user/open/gtk+/2.24.30/bin:/share/software/user/open/gdk-pixbuf/2.36.8/bin:/share/software/user/open/gobject-introspection/1.52.1/bin:/share/software/user/open/libtiff/4.0.8/bin:/share/software/user/open/libjpeg-turbo/1.5.1/bin:/share/software/user/open/pango/1.40.10/bin:/share/software/user/open/harfbuzz/1.4.8/bin:/share/software/user/open/icu/59.1/bin:/share/software/user/open/cairo/1.14.10/bin:/share/software/user/open/glib/2.52.3/bin:/share/software/user/open/py-matplotlib/2.2.2_py27/bin:/share/software/user/open/py-numpy/1.14.3_py27/bin:/share/software/user/open/python/2.7.13/bin:/share/software/user/open/sqlite/3.18.0/bin:/share/software/user/open/tcltk/8.6.6/bin:/share/software/user/open/libressl/2.5.3/bin:/share/software/user/open/xz/5.2.3/bin:/share/software/user/open/py-tensorflow/1.5.0_py27/bin:/share/software/user/open/py-h5py/2.7.1_py27/bin:/share/software/user/open/hdf5/1.10.2/bin:/share/software/user/open/openmpi/3.1.2/bin:/share/software/user/open/libfabric/1.6.0/bin:/share/software/user/open/ucx/1.3.1/bin:/share/software/user/open/x11/7.7/bin:/share/software/user/open/llvm/4.0.0/bin:/share/software/user/open/libxml2/2.9.4/bin:/share/software/user/open/fontconfig/2.12.4/bin:/share/software/user/open/freetype/2.8/bin:/share/software/user/open/libpng/1.2.57/bin:/usr/lib64/nvidia:/share/software/user/open/cuda/9.0.176/bin:/share/software/user/open/cuda/9.0.176/nvvm/bin:/share/software/user/srcc/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/users/mfp2/.local/bin:/home/users/mfp2/bin:/opt/dell/srvadmin/bin:/opt/dell/srvadmin/iSM/bin:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads:/home/users/mfp2/.local/bin:/home/users/mfp2/bin; export CUDA_VISIBLE_DEVICES=0; bash /oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfads_train.sh'" 

I suspect I need to be calling tmux differently but have hit a dead end. Any idea where I might be going wrong?