lfads / lfads-run-manager

Matlab interface for Latent Factor Analysis via Dynamical Systems (LFADS)
https://lfads.github.io/lfads-run-manager
Apache License 2.0
50 stars 29 forks source link

Fix lfadsqueue issue when no tmux server running #7

Closed djoshea closed 5 years ago

djoshea commented 6 years ago

Reported by Liza, tmux list sessions fails when no sessions are running.

djoshea commented 6 years ago

Temporary workaround is to open a tmux session, then run lfadsqueue within that session.

charlesbmi commented 5 years ago

I get a similar error when running python run_lfadsqueue.py in a normal shell (without tmux). However, the workaround doesn't seem to work for me.

Below are logs (run within a tmux session)

(base) charles@Dell-G5-5587:~/Development$ cd lfads-run-manager/src/
(base) charles@Dell-G5-5587:~/Development/lfads-run-manager/src$ conda activate tensorflow-gpu-py2
(tensorflow-gpu-py2) charles@Dell-G5-5587:~/Development/lfads-run-manager/src$ python ~/lorenz_example/runs/exampleSingleSession/run_lfadsqueue.py 
Warning: tmux sessions will be nested inside the current session
Queue: Launching TensorBoard on port 44275 in tmux session exampleSingleSession_tensorboard_port44275
bash /home/charles/lorenz_example/runs/exampleSingleSession/launch_tensorboard.sh --port=44275
Queue: Initializing with 1 GPUs and 12 CPUs, max 1 simultaneous tasks
Task lfads_param_YOs74u_run001_single_dataset001: INTERNAL ERROR. Exception was:
Traceback (most recent call last):
  File "/home/charles/Development/lfads-run-manager/src/lfadsqueue.py", line 343, in process_launch_task_in_tmux
    raise ChildProcessError('Tmux session immediately terminated running "{}" '.format(tmux_command))
ChildProcessError: Tmux session immediately terminated running "tmux new-session -ds lfads_param_YOs74u_run001_single_dataset001 'export PATH=/home/charles/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/home/charles/anaconda3/envs/tensorflow-gpu-py2/bin:/home/charles/anaconda3/condabin:/home/charles/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin; export CUDA_VISIBLE_DEVICES=0; bash /home/charles/lorenz_example/runs/exampleSingleSession/param_YOs74u/single_dataset001/lfads_train.sh'" 

Queue: All tasks completed.
Queue: 0 skipped, 0 finished, 1 failed, 0 running

If I open a tmux session, then run python run_lfadsqueue.py from a separate shell (without tmux), I get a slightly different error:

(tensorflow-gpu-py2) charles@Dell-G5-5587:~/Development/lfads-run-manager/src$ python ~/lorenz_example/runs/exampleSingleSession/run_lfadsqueue.py 
Queue: Launching TensorBoard on port 56885 in tmux session exampleSingleSession_tensorboard_port56885
bash /home/charles/lorenz_example/runs/exampleSingleSession/launch_tensorboard.sh --port=56885
Queue: Initializing with 1 GPUs and 12 CPUs, max 1 simultaneous tasks
Task lfads_param_YOs74u_run001_single_dataset001: INTERNAL ERROR. Exception was:
Traceback (most recent call last):
  File "/home/charles/Development/lfads-run-manager/src/lfadsqueue.py", line 343, in process_launch_task_in_tmux
    raise ChildProcessError('Tmux session immediately terminated running "{}" '.format(tmux_command))
ChildProcessError: Tmux session immediately terminated running "tmux new-session -ds lfads_param_YOs74u_run001_single_dataset001 'export PATH=/usr/local/cuda/bin:/home/charles/anaconda3/envs/tensorflow-gpu-py2/bin:/home/charles/anaconda3/condabin:/home/charles/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin; export CUDA_VISIBLE_DEVICES=0; bash /home/charles/lorenz_example/runs/exampleSingleSession/param_YOs74u/single_dataset001/lfads_train.sh'" 

Queue: All tasks completed.
Queue: 0 skipped, 0 finished, 1 failed, 0 running

I don't normally use tmux, so any help would be appreciated. Thanks!

djoshea commented 5 years ago

Hey @charlesincharge, sorry for not getting back to you, I somehow wasn't getting email notifications for a bit from Github. I hope you've been able to resolve the issue, but in case not or if someone else encounters this, I think this isn't an issue with tmux but the actual command being quoted. Can you try running this command directly at the command line? This is just what was queued up to run, except for the tmux session encapsulation:

export PATH=/usr/local/cuda/bin:/home/charles/anaconda3/envs/tensorflow-gpu-py2/bin:/home/charles/anaconda3/condabin:/home/charles/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin; export CUDA_VISIBLE_DEVICES=0; bash /home/charles/lorenz_example/runs/exampleSingleSession/param_YOs74u/single_dataset001/lfads_train.sh'

I'm guessing there is some kind of issue just running the LFADS code, and maybe for some reason the lfadsqueue.py code isn't capturing the error message.

Thanks and sorry again! Dan

djoshea commented 5 years ago

This was closed because I fixed the underlying issue I was originally describing, but please open a new issue if you're still seeing the issue you're describing @charlesincharge

Thanks!

charlesbmi commented 5 years ago

Thanks! Turns out I had forgotten to add LFADS to my path: Error: run_lfads.py not found on PATH. Ensure you add LFADS to your system PATH. That fixed it.

panichem commented 2 years ago

Hey Dan,

I ran into the same initial problem as charles (the tmux session encapsulation obscuring output while debugging python run_lfadsqueue.py) . As you suggested, I tried running the relevant commands directly (in this case, as a shell script):

#!/bin/bash

export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
export PYTHONPATH=$PYTHONPATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/lfads-run-manager/src
export PATH=$PATH:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads
ml python/2.7
ml py-h5py/2.7.1_py27
ml py-tensorflow/1.5.0_py27
ml viz
ml py-numpy
ml py-scipy/1.1.0_py27
ml py-matplotlib
ml matlab

PATH=/share/software/user/restricted/matlab/R2020a/bin:/share/software/user/open/cups/2.2.4/bin:/share/software/user/open/gconf/2.9.91/bin:/share/software/user/open/orbit/2.14.19/bin:/share/software/user/open/libidl/0.8.14/bin:/share/softw
are/user/open/gtk+/2.24.30/bin:/share/software/user/open/gdk-pixbuf/2.36.8/bin:/share/software/user/open/gobject-introspection/1.52.1/bin:/share/software/user/open/libtiff/4.0.8/bin:/share/software/user/open/libjpeg-turbo/1.5.1/bin:/share/
software/user/open/pango/1.40.10/bin:/share/software/user/open/harfbuzz/1.4.8/bin:/share/software/user/open/icu/59.1/bin:/share/software/user/open/cairo/1.14.10/bin:/share/software/user/open/glib/2.52.3/bin:/share/software/user/open/py-mat
plotlib/2.2.2_py27/bin:/share/software/user/open/py-numpy/1.14.3_py27/bin:/share/software/user/open/python/2.7.13/bin:/share/software/user/open/sqlite/3.18.0/bin:/share/software/user/open/tcltk/8.6.6/bin:/share/software/user/open/libressl/
2.5.3/bin:/share/software/user/open/xz/5.2.3/bin:/share/software/user/open/py-tensorflow/1.5.0_py27/bin:/share/software/user/open/py-h5py/2.7.1_py27/bin:/share/software/user/open/hdf5/1.10.2/bin:/share/software/user/open/openmpi/3.1.2/bin:
/share/software/user/open/libfabric/1.6.0/bin:/share/software/user/open/ucx/1.3.1/bin:/usr/lib64/nvidia:/share/software/user/open/cuda/9.0.176/bin:/share/software/user/open/cuda/9.0.176/nvvm/bin:/share/software/user/srcc/bin:/share/softwar
e/user/open/x11/7.7/bin:/share/software/user/open/llvm/4.0.0/bin:/share/software/user/open/libxml2/2.9.4/bin:/share/software/user/open/fontconfig/2.12.4/bin:/share/software/user/open/freetype/2.8/bin:/share/software/user/open/libpng/1.2.57
/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/users/mfp2/.local/bin:/home/users/mfp2/bin:/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads; export CUDA_VISIBLE_DEVICES=0; bash /oak/stan
ford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfads_train.sh

I'm getting more informative output now, although I still haven't been able to generate an informative error message:

/share/software/user/open/py-h5py/2.7.1_py27/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2022-01-23 11:00:37.031413: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2022-01-23 11:00:37.160938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:05:00.0
totalMemory: 10.76GiB freeMemory: 10.61GiB
2022-01-23 11:00:37.161017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5)
/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads/lfads.py:323: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  datasets[hps.dataset_names[0]]['train_data'].dtype, int), \
WARNING:tensorflow:From /oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/toolboxes/models/research/lfads/utils.py:140: calling l2_normalize (from tensorflow.python.ops.nn_impl) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfads_train.sh: line 22: 54792 Killed                  DISPLAY=:0 python $(which run_lfads.py) --data_dir=/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfadsInput --data_filename_stem=lfads --lfads_save_dir=/oak/stanford/groups/tirin/data/RigE/singleTrialDynamics/analysis/ppLFADS/runs/lookAndMgs/memory_on/exampleSingleSession/param_bQPjV2/single_aq_20211015_spikes/lfadsOutput --allow_gpu_growth=true --max_ckpt_to_keep=5 --max_ckpt_to_keep_lve=5 --device=/gpu:0 --learning_rate_init=0.010000 --learning_rate_decay_factor=0.980000 --learning_rate_n_to_compare=6 --learning_rate_stop=0.001000 --max_grad_norm=200.000000 --batch_size=40 --cell_clip_value=5.000000 --temporal_spike_jitter_width=0 --keep_prob=0.950000 --l2_gen_scale=500.000000 --l2_con_scale=500.000000 --co_mean_corr_scale=0.000000 --kl_ic_weight=1.000000 --kl_co_weight=1.000000 --kl_start_step=0 --kl_increase_steps=900 --l2_start_step=0 --l2_increase_steps=900 --ext_input_dim=0 --inject_ext_input_to_gen=false --co_dim=0 --prior_ar_atau=10.000000 --do_train_prior_ar_atau=true --prior_ar_nvar=0.100000 --do_train_prior_ar_nvar=true --do_causal_controller=false --do_feed_factors_to_controller=true --feedback_factors_or_rates=factors --controller_input_lag=1 --ci_enc_dim=128 --con_dim=128 --co_prior_var_scale=0.100000 --num_steps_for_gen_ic=4294967295 --ic_dim=64 --ic_enc_dim=64 --ic_prior_var_min=0.100000 --ic_prior_var_scale=0.100000 --ic_prior_var_max=0.100000 --ic_post_var_min=0.000100 --cell_weight_scale=1.000000 --gen_dim=64 --gen_cell_input_weight_scale=1.000000 --gen_cell_rec_weight_scale=1.000000 --factors_dim=8 --output_dist=poisson --do_train_readin=true --tf_debug_cli=false --tf_debug_tensorboard=false --tf_debug_tensorboard_hostport=localhost:6064 --debug_verbose=true --debug_print_each_step=true

Any advice on how to troubleshoot would be appreciated!

Thanks Matt