How can I run these code?

ogroth commented 6 years ago

Hey Yangshell, there's no official description to run the code, yet. We're still experimenting with the architecture to find a good training setup to replicate the paper's results on the rooms_ring_camera dataset. Once we have a stable version and working model snapshots, we will merge the dev branch into master and write a detailed Readme. In the meanwhile, you can check https://github.com/ogroth/tf-gqn/blob/rooms_ring_camera_training/train_gqn_draw.py This is the run script training the GQN (provided you have downloaded the training data). However, we haven't managed to produce great visual results, yet.

ogroth commented 6 years ago

We have released a stable version of GQN which trains on the rooms_ring_camera dataset with the default parameters we provide. The training script is: https://github.com/ogroth/tf-gqn/blob/master/train_gqn_draw.py Please see the Readme for detailed instructions on how to set up and run the code.

wlred commented 6 years ago

i download the rooms_ring_camera dataset. but after i run the the script. the system will kill it. my computer has one 1080ti GPU. it not enough, needs better one ?

ogroth commented 6 years ago

i download the rooms_ring_camera dataset. but after i run the the script. the system will kill it. my computer has one 1080ti GPU. it not enough, needs better one ?

Hi wlred, could you please give a more detailed version of the error you get? Which script have you run (with which parameters) and what happened after it had been launched? Has it run into an out-of-memory error? A GTX 1080Ti is definitely sufficient to train the model.

wlred commented 6 years ago

Hi ogroth, my steps: (1)download the rooms_ring_camera dataset (2)run the command: 1>source venv/bin/activate 2>python3 train_gqn_draw.py --data_dir /tmp/data/gqn-dataset --dataset rooms_ring_camera --model_dir /tmp/models/gqn

the output log is these: Training a GQN. FLAGS: Namespace(batch_size=36, chkpt_steps=10000, data_dir='/tmp/data/gqn-dataset', dataset='rooms_ring_camera', debug=False, initial_eval=False, log_steps=100, memcap=1.0, model_dir='/tmp/models/gqn', queue_buffer=64, queue_threads=4, train_epochs=40) UNPARSED_ARGV: ['--mode_dir', '/tmp/models/gqn'] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_global_id_in_cluster': 0, '_task_id': 0, '_service': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 1.0 allow_growth: true } , '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_master': '', '_task_type': 'worker', '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fec2a484f28>, '_model_dir': '/tmp/models/gqn', '_save_checkpoints_steps': 10000, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_train_distribute': None} INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. kill

(3) after run the train_gqn_draw.py script. maybe 20 second. the system kill the process. and my computer is slow

ogroth commented 6 years ago

You seem to have a typo in your CLI parameters when calling the script:

UNPARSED_ARGV: ['--mode_dir', '/tmp/models/gqn']

That should read: --model_dir /tmp/models/gqn

wlred commented 6 years ago

run the command: python3 train_gqn_draw.py --data_dir /tmp/data/gqn-dataset --dataset rooms_ring_camera still killed

wlred commented 6 years ago

the output log is these: Training a GQN. FLAGS: Namespace(batch_size=36, chkpt_steps=10000, data_dir='/tmp/data/gqn-dataset', dataset='rooms_ring_camera', debug=False, initial_eval=False, log_steps=100, memcap=1.0, model_dir='/tmp/models/gqn', queue_buffer=64, queue_threads=4, train_epochs=40) UNPARSED_ARGV: [] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_global_id_in_cluster': 0, '_task_id': 0, '_service': None, '_session_config': gpu_options { per_process_gpu_memory_fraction: 1.0 allow_growth: true } , '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_master': '', '_task_type': 'worker', '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fec2a484f28>, '_model_dir': '/tmp/models/gqn', '_save_checkpoints_steps': 10000, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_train_distribute': None} INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. kill

ogroth commented 6 years ago

Have you tried to monitor your system with htop and nvidia-smi to check whether there is any unusual behaviour in terms of CPU / GPU usage or memory allocation? That's the only thing I can think off the top of my head which could cause the OS to kill the process. Which OS are you using?

wlred commented 6 years ago

ubuntu 16.04

wlred commented 6 years ago

hi ogroth, How big is the memory of your computer?

ogroth commented 6 years ago

We've trained on machines with 32GB of RAM, but training never occupied more than 8GB at any time.

wlred commented 6 years ago

interesting, i had run your code on 3 computers. all can not run the code. all be killed. maybe a lot of people have the same problem

ogroth / tf-gqn

How can I run these code? #8