tkwong / parameter_server

Apache License 2.0
1 stars 0 forks source link

Hang at InitTable #32

Open tkwong opened 6 years ago

tkwong commented 6 years ago

Running parameter & attached Logs

proj10:/data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_1_perm_False_tran_True_delay_500>ps -afe | grep ssh | grep Log
1155101+ 106642 171067  0 21:08 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj6 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=0 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=500 --batch_size=100 --get_updated_workload_rate=1 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=33430 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_1_perm_False_tran_True_delay_500/output.0.log 2>&1 
1155101+ 106643 171067  0 21:08 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj7 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=1 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=500 --batch_size=100 --get_updated_workload_rate=1 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=33430 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_1_perm_False_tran_True_delay_500/output.1.log 2>&1 
1155101+ 106644 171067  0 21:08 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj8 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=2 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=500 --batch_size=100 --get_updated_workload_rate=1 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=33430 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_1_perm_False_tran_True_delay_500/output.2.log 2>&1 
1155101+ 106645 171067  0 21:08 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj9 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=3 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=500 --batch_size=100 --get_updated_workload_rate=1 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=33430 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_1_perm_False_tran_True_delay_500/output.3.log 2>&1 
proj10:/data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_1_perm_False_tran_True_delay_500>

output.3.log output.0.log output.2.log output.1.log

tkwong commented 6 years ago

output.3.log output.1.log output.2.log output.0.log

1155101+ 144564 171067  0 01:21 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj6 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=0 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=5 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=False --activate_permanent_straggler=True --hdfs_master_port=21918 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_5_perm_True_tran_False_delay_100/output.0.log 2>&1 
1155101+ 144565 171067  0 01:21 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj7 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=1 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=5 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=False --activate_permanent_straggler=True --hdfs_master_port=21918 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_5_perm_True_tran_False_delay_100/output.1.log 2>&1 
1155101+ 144568 171067  0 01:21 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj8 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=2 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=5 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=False --activate_permanent_straggler=True --hdfs_master_port=21918 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_5_perm_True_tran_False_delay_100/output.2.log 2>&1 
1155101+ 144569 171067  0 01:21 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj9 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=3 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=5 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=False --activate_permanent_straggler=True --hdfs_master_port=21918 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_5_perm_True_tran_False_delay_100/output.3.log 2>&1 
tkwong commented 6 years ago

output.3.log output.1.log output.2.log output.0.log

1155101+  40946 171067  0 18:23 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj6 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=0 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=10 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=25508 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_10_perm_False_tran_True_delay_100/output.0.log 2>&1 
1155101+  40947 171067  0 18:23 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj7 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=1 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=10 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=25508 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_10_perm_False_tran_True_delay_100/output.1.log 2>&1 
1155101+  40948 171067  0 18:23 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj8 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=2 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=10 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=25508 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_10_perm_False_tran_True_delay_100/output.2.log 2>&1 
1155101+  40949 171067  0 18:23 ?        00:00:00 ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null proj9 env GLOG_logtostderr=true GLOG_v=2 LIBHDFS3_CONF=/data/opt/course/hadoop/etc/hadoop/hdfs-site.xml /data/opt/tmp/1155101317/parameter_server/build/LogisticRegression --my_id=3 --n_iters=100 --config_file=/data/opt/tmp/1155101317/parameter_server/machinefiles/5node --with_injected_straggler_delay_percent=100 --batch_size=100 --get_updated_workload_rate=10 --activate_scheduler=True --prefetch_model_before_batch=False --n_workers_per_node=5 --activate_transient_straggler=True --activate_permanent_straggler=False --hdfs_master_port=25508 --input=hdfs://proj10:9000/datasets/classification/avazu-app-part/ --n_features=1000000 >> /data/opt/tmp/1155101317/parameter_server/logs/1512996467/update_10_perm_False_tran_True_delay_100/output.3.log 2>&1 
tkwong commented 6 years ago

output.1-1.log output.3-1.log output.2-1.log output.0-1.log

diagnostic.patch.txt

It is another new set of hang log, I have added some diagnostic log (see attached) and seems that node 1,2,3 sending out "kBarrier" message, but the node [0] didn't do Barrier (see Barrier Called log in output-1,2,3 ) out and hang somewhere.