Closed Yangshell closed 6 years ago
We have released a stable version of GQN which trains on the rooms_ring_camera dataset with the default parameters we provide. The training script is: https://github.com/ogroth/tf-gqn/blob/master/train_gqn_draw.py Please see the Readme for detailed instructions on how to set up and run the code.
i download the rooms_ring_camera dataset. but after i run the the script. the system will kill it. my computer has one 1080ti GPU. it not enough, needs better one ?
i download the rooms_ring_camera dataset. but after i run the the script. the system will kill it. my computer has one 1080ti GPU. it not enough, needs better one ?
Hi wlred, could you please give a more detailed version of the error you get? Which script have you run (with which parameters) and what happened after it had been launched? Has it run into an out-of-memory error? A GTX 1080Ti is definitely sufficient to train the model.
Hi ogroth, my steps: (1)download the rooms_ring_camera dataset (2)run the command: 1>source venv/bin/activate 2>python3 train_gqn_draw.py --data_dir /tmp/data/gqn-dataset --dataset rooms_ring_camera --model_dir /tmp/models/gqn
the output log is these: Training a GQN. FLAGS: Namespace(batch_size=36, chkpt_steps=10000, data_dir='/tmp/data/gqn-dataset', dataset='rooms_ring_camera', debug=False, initial_eval=False, log_steps=100, memcap=1.0, model_dir='/tmp/models/gqn', queue_buffer=64, queue_threads=4, train_epochs=40) UNPARSED_ARGV: ['--mode_dir', '/tmp/models/gqn'] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_global_id_in_cluster': 0, '_task_id': 0, '_service': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 1.0 allow_growth: true } , '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_master': '', '_task_type': 'worker', '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fec2a484f28>, '_model_dir': '/tmp/models/gqn', '_save_checkpoints_steps': 10000, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_train_distribute': None} INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. kill
(3) after run the train_gqn_draw.py script. maybe 20 second. the system kill the process. and my computer is slow
You seem to have a typo in your CLI parameters when calling the script:
UNPARSED_ARGV: ['--mode_dir', '/tmp/models/gqn']
That should read: --model_dir /tmp/models/gqn
run the command: python3 train_gqn_draw.py --data_dir /tmp/data/gqn-dataset --dataset rooms_ring_camera still killed
the output log is these: Training a GQN. FLAGS: Namespace(batch_size=36, chkpt_steps=10000, data_dir='/tmp/data/gqn-dataset', dataset='rooms_ring_camera', debug=False, initial_eval=False, log_steps=100, memcap=1.0, model_dir='/tmp/models/gqn', queue_buffer=64, queue_threads=4, train_epochs=40) UNPARSED_ARGV: [] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_global_id_in_cluster': 0, '_task_id': 0, '_service': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 1.0 allow_growth: true } , '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_master': '', '_task_type': 'worker', '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fec2a484f28>, '_model_dir': '/tmp/models/gqn', '_save_checkpoints_steps': 10000, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_train_distribute': None} INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. kill
Have you tried to monitor your system with htop
and nvidia-smi
to check whether there is any unusual behaviour in terms of CPU / GPU usage or memory allocation? That's the only thing I can think off the top of my head which could cause the OS to kill the process. Which OS are you using?
ubuntu 16.04
hi ogroth, How big is the memory of your computer?
We've trained on machines with 32GB of RAM, but training never occupied more than 8GB at any time.
interesting, i had run your code on 3 computers. all can not run the code. all be killed. maybe a lot of people have the same problem
Hey Yangshell, there's no official description to run the code, yet. We're still experimenting with the architecture to find a good training setup to replicate the paper's results on the rooms_ring_camera dataset. Once we have a stable version and working model snapshots, we will merge the dev branch into master and write a detailed Readme. In the meanwhile, you can check https://github.com/ogroth/tf-gqn/blob/rooms_ring_camera_training/train_gqn_draw.py This is the run script training the GQN (provided you have downloaded the training data). However, we haven't managed to produce great visual results, yet.