Closed xzhangxa closed 6 years ago
Hello, could you please try ./scripts/run_intelcaffe.sh --hostfile hosts --solver examples/mnist/lenet_solver_mlsl.prototxt --network tcp --netmask enp134s0f0
? And what's your CPU model? Could you please try lscpu
too?
@chuanqi129 It's same result using lenet_solver_mlsl.prototxt. CPU model is Intel Xeon Gold 6140, full lscpu
output is:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping: 4
CPU MHz: 1499.941
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req
@zhang-xin Thanks for quickly reply. And when you create contanier, did you add--shm-size=40G
? And could you try docker info
and docker inspect test
? the test
is the container name
@chuanqi129 I created the container just as the wiki docker run -tid --net host --name test --privileged --shm-size=40G bvlc/caffe:intel_multinode
.
I manged to make it work by using --num_mlsl_servers 0
with others kept same, then multinode training seems working. By default it's -1 auto choosing 4 (KNL) or 2 (BDW/SKX), both don't work. Is it just workaround or desired behavior?
@zhang-xin I don't think it make sense, could you also sent out apart of your workaround log? Can I see the information of docker info
and docker inspect test
? By the way, I can't reproduce this issue on skx-8180 and bdw-2699
@chuanqi129 Below is the docker and container info, and logs with/without num_mlsl_servers=0. With it it's fine, without it the process stuck there, on each client there's one caffe process using and only using 100% CPU.
Actually I found this solution on Intel MLSL issue https://github.com/intel/MLSL/issues/9. The problem looked similar so I tried that.
BTW, I need to comment out test_ssh_connection
function in scripts/run_intelcaffe.sh otherwise script will ask me for passwd and default 123456 doesn't work, even though public key access is already ok.
docker info
output:
[z1r04h17@jfz1r04h17 ~]$ sudo docker info
Containers: 9
Running: 2
Paused: 0
Stopped: 7
Images: 32
Server Version: 1.13.1
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: docker-init
containerd version: (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: N/A (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
WARNING: You're not using the default seccomp profile
Profile: /etc/docker/seccomp.json
selinux
Kernel Version: 3.10.0-693.21.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 72
Total Memory: 125.4 GiB
Name: jfz1r04h17
ID: 67OS:IZJP:MT66:56VW:QID5:QY6D:5S3D:EOMO:SS45:WQWA:QFRO:YR4R
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Registries: docker.io (secure)
docker inspect test
output:
[z1r04h17@jfz1r04h17 ~]$ sudo docker inspect test
[
{
"Id": "9461ce4b05bfe15314ba4addd161c13f3f9dbebc9164773c4b61a75ab306d404",
"Created": "2018-04-27T10:56:54.827011974Z",
"Path": "/usr/sbin/sshd",
"Args": [
"-D"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 234456,
"ExitCode": 0,
"Error": "",
"StartedAt": "2018-04-27T10:56:55.024590949Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
"Image": "sha256:55c45a63f8f640e694ec59ce4fd288ea2fc432432b737abddbecd4b7f17783a2",
"ResolvConfPath": "/var/lib/docker/containers/9461ce4b05bfe15314ba4addd161c13f3f9dbebc9164773c4b61a75ab306d404/resolv.conf",
"HostnamePath": "/var/lib/docker/containers/9461ce4b05bfe15314ba4addd161c13f3f9dbebc9164773c4b61a75ab306d404/hostname",
"HostsPath": "/var/lib/docker/containers/9461ce4b05bfe15314ba4addd161c13f3f9dbebc9164773c4b61a75ab306d404/hosts",
"LogPath": "",
"Name": "/test",
"RestartCount": 0,
"Driver": "overlay2",
"MountLabel": "",
"ProcessLabel": "",
"AppArmorProfile": "",
"ExecIDs": null,
"HostConfig": {
"Binds": null,
"ContainerIDFile": "",
"LogConfig": {
"Type": "journald",
"Config": {}
},
"NetworkMode": "host",
"PortBindings": {},
"RestartPolicy": {
"Name": "no",
"MaximumRetryCount": 0
},
"AutoRemove": false,
"VolumeDriver": "",
"VolumesFrom": null,
"CapAdd": null,
"CapDrop": null,
"Dns": [],
"DnsOptions": [],
"DnsSearch": [],
"ExtraHosts": null,
"GroupAdd": null,
"IpcMode": "",
"Cgroup": "",
"Links": null,
"OomScoreAdj": 0,
"PidMode": "",
"Privileged": true,
"PublishAllPorts": false,
"ReadonlyRootfs": false,
"SecurityOpt": [
"label=disable"
],
"UTSMode": "",
"UsernsMode": "",
"ShmSize": 42949672960,
"Runtime": "docker-runc",
"ConsoleSize": [
0,
0
],
"Isolation": "",
"CpuShares": 0,
"Memory": 0,
"NanoCpus": 0,
"CgroupParent": "",
"BlkioWeight": 0,
"BlkioWeightDevice": null,
"BlkioDeviceReadBps": null,
"BlkioDeviceWriteBps": null,
"BlkioDeviceReadIOps": null,
"BlkioDeviceWriteIOps": null,
"CpuPeriod": 0,
"CpuQuota": 0,
"CpuRealtimePeriod": 0,
"CpuRealtimeRuntime": 0,
"CpusetCpus": "",
"CpusetMems": "",
"Devices": [],
"DiskQuota": 0,
"KernelMemory": 0,
"MemoryReservation": 0,
"MemorySwap": 0,
"MemorySwappiness": -1,
"OomKillDisable": false,
"PidsLimit": 0,
"Ulimits": null,
"CpuCount": 0,
"CpuPercent": 0,
"IOMaximumIOps": 0,
"IOMaximumBandwidth": 0
},
"GraphDriver": {
"Name": "overlay2",
"Data": {
"LowerDir": "/var/lib/docker/overlay2/e58d7b390c291907f8d044884e4d3624bb8366941798d4d880bb0f7d783da35b-init/diff:/var/lib/docker/overlay2/374a56a948bde624f8bc0bf1de9c02452b457b62df8ba7779e5a8b9f4b48bdf0/diff:/var/lib/docker/overlay2/0e3b50fe1d4074f11802f471bfb87e7977ad164143d3d8ee4a43c999b0192e49/diff:/var/lib/docker/overlay2/cf77251f189611522a2d5ac353a62ab11d18d3b9dcd5cda28d618322dac29d6b/diff:/var/lib/docker/overlay2/e56c32c6aafc2f9d6e4e7901ba5b8922fc6e2a119e63df1199100739ea38147a/diff:/var/lib/docker/overlay2/d29aafdf93edccbe9b0910da13de665169f631f5d742bda781459670f2d80839/diff:/var/lib/docker/overlay2/2e464ff8386bccd4119a540dd46acac3af5a986286fec84328ba836254385ef0/diff:/var/lib/docker/overlay2/b1400afe7e3eed6e08211dbfab3f81fd3121c7a2602080110ded7839b1bdb4f6/diff:/var/lib/docker/overlay2/22ac113d7d3c52e7f146fc005358716a426f5049728fda48c87e7af595607e09/diff:/var/lib/docker/overlay2/aad9061b290341688002d72aa678689dcc34f9ed9e42abbc22dfa9a0c9d436c1/diff:/var/lib/docker/overlay2/533f872472f56be375a7bc91bfdc74ba80cb3324125c65e95d2099c548e3b5af/diff:/var/lib/docker/overlay2/b135baa56c5d84a2e39dc2015afae3a542c0f9ad3f58ecefc085817a2d940d29/diff:/var/lib/docker/overlay2/15d9d74d18f920077341c56011beb6e884187c6f191d8f93114c6ca0a91a583e/diff:/var/lib/docker/overlay2/1f0031f1652a52cb3fe5e0dc4256a2ae4ba6e61d5e18780b8e2f8502063e0bf5/diff:/var/lib/docker/overlay2/82718a5e218f6bc1f52a647846acfadfdb557de56786298a2058fefa3e750453/diff:/var/lib/docker/overlay2/5fd590516dc3eb64796753a0d0a83c89b22ffc789d97581229bf7a021ef38ed2/diff:/var/lib/docker/overlay2/9bcbd1a28944867e1b96188d20f5aab382d4e4e723c0c98fbd3ce2c249ffd498/diff",
"MergedDir": "/var/lib/docker/overlay2/e58d7b390c291907f8d044884e4d3624bb8366941798d4d880bb0f7d783da35b/merged",
"UpperDir": "/var/lib/docker/overlay2/e58d7b390c291907f8d044884e4d3624bb8366941798d4d880bb0f7d783da35b/diff",
"WorkDir": "/var/lib/docker/overlay2/e58d7b390c291907f8d044884e4d3624bb8366941798d4d880bb0f7d783da35b/work"
}
},
"Mounts": [],
"Config": {
"Hostname": "jfz1r04h17",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"ExposedPorts": {
"10010/tcp": {}
},
"Tty": true,
"OpenStdin": true,
"StdinOnce": false,
"Env": [
"PATH=/opt/caffe/build/tools:/opt/caffe/python:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"CAFFE_ROOT=/opt/caffe",
"CLONE_TAG=master",
"PYCAFFE_ROOT=/opt/caffe/python",
"PYTHONPATH=/opt/caffe/python:",
"NOTVISIBLE=in users profile"
],
"Cmd": [
"/usr/sbin/sshd",
"-D"
],
"ArgsEscaped": true,
"Image": "bvlc/caffe:intel_multinode",
"Volumes": null,
"WorkingDir": "/opt/caffe",
"Entrypoint": null,
"OnBuild": null,
"Labels": {}
},
"NetworkSettings": {
"Bridge": "",
"SandboxID": "49b1e71478d81172f6661e7980be22da6d5f62618536cbd896d990cbcfd96329",
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"Ports": {},
"SandboxKey": "/var/run/docker/netns/default",
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null,
"EndpointID": "",
"Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAddress": "",
"IPPrefixLen": 0,
"IPv6Gateway": "",
"MacAddress": "",
"Networks": {
"host": {
"IPAMConfig": null,
"Links": null,
"Aliases": null,
"NetworkID": "da2b061e29f0032925da63e696ded0a04895184839683d6afe4175ca37b188c6",
"EndpointID": "368d2decbbb98422297af63f3f01acb4c02abbbc06131174fa0119f8c47f9800",
"Gateway": "",
"IPAddress": "",
"IPPrefixLen": 0,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"MacAddress": ""
}
}
}
}
]
Log when setting --num_mlsl_servers 0
root@jfz1r04h17:/opt/caffe# ./scripts/run_intelcaffe.sh --hostfile hosts --solver examples/mnist/lenet_solver_mlsl.prototxt --network tcp --netmask enp134s0f0 --num_mlsl_servers 0
CPUs with optimal settings:
Intel Xeon E7-88/48xx, E5-46/26/16xx, E3-12xx, D15/D-15 (Broadwell)
Intel Xeon Phi 7210/30/50/90 (Knights Landing)
Intel Xeon Platinum 81/61/51/41/31xx (Skylake)
Settings:
CPU: skx
Host file: hosts
Running mode: train
Benchmark: none
Debug option: off
Engine:
Number of MLSL servers: 0
-1: selected automatically according to CPU model.
BDW/SKX: 2, KNL: 4
Solver file: examples/mnist/lenet_solver_mlsl.prototxt
LMDB data source: examples/mnist/mnist_train_lmdb
LMDB data source: examples/mnist/mnist_test_lmdb
Network: tcp
Netmask for TCP network: enp134s0f0
NUMA configuration: Flat mode.
Create result directory: /opt/caffe/result-20180427110708
Number of nodes: 2
MLSL_NUM_SERVERS: 0
Pin internal threads to: 70,71
Number of OpenMP threads: 36
Run caffe with 2 nodes...
Warning: cannot find sensors
[0] [0] MPI startup(): Intel(R) MPI Library, Version 2018 Update 1 Build 20171011 (id: 17941)
[0] [0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation. All rights reserved.
[0] [0] MPI startup(): Multi-threaded optimized library
[0] [0] ckpt_restart(): The real interface being used for tcp is enp134s0f0 and interface hostname is jfz1r04h18
[0] [0] MPI startup(): tcp data transfer mode
[1] [1] ckpt_restart(): The real interface being used for tcp is enp134s0f0 and interface hostname is jfz1r04h19
[1] [1] MPI startup(): tcp data transfer mode
[0] [0] MPI startup(): Device_reset_idx=5
[0] [0] MPI startup(): Allgather: 4: 27306-38912 & 0-2
[0] [0] MPI startup(): Allgather: 4: 78064-294912 & 0-2
[0] [0] MPI startup(): Allgather: 3: 0-27306 & 0-2
[0] [0] MPI startup(): Allgather: 3: 38912-78064 & 0-2
[0] [0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2
[0] [0] MPI startup(): Allgather: 1: 0-7 & 3-4
[0] [0] MPI startup(): Allgather: 1: 9-4607 & 3-4
[0] [0] MPI startup(): Allgather: 1: 66622-461338 & 3-4
[0] [0] MPI startup(): Allgather: 3: 9081-26350 & 3-4
[0] [0] MPI startup(): Allgather: 3: 461338-2692119 & 3-4
[0] [0] MPI startup(): Allgather: 4: 7-9 & 3-4
[0] [0] MPI startup(): Allgather: 4: 4607-9081 & 3-4
[0] [0] MPI startup(): Allgather: 4: 26350-66622 & 3-4
[0] [0] MPI startup(): Allgather: 4: 0-2147483647 & 3-4
[0] [0] MPI startup(): Allgather: 2: 1-1 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 2-3 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 4-5 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 6-26 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 27-98 & 5-2147483647
[0] [0] MPI startup(): Allgather: 3: 99-1029 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 1030-5572 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 5573-15186 & 5-2147483647
[0] [0] MPI startup(): Allgather: 2: 15187-33976 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 33977-74391 & 5-2147483647
[0] [0] MPI startup(): Allgather: 3: 74392-131842 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2
[0] [0] MPI startup(): Allgatherv: 1: 0-2 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 2-7 & 3-4
[0] [0] MPI startup(): Allgatherv: 1: 7-49 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 49-113 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 113-149 & 3-4
[0] [0] MPI startup(): Allgatherv: 3: 149-915 & 3-4
[0] [0] MPI startup(): Allgatherv: 1: 915-1614 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 1614-3296 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 3296-5670 & 3-4
[0] [0] MPI startup(): Allgatherv: 1: 5670-10998 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 10998-185966 & 3-4
[0] [0] MPI startup(): Allgatherv: 3: 185966-381166 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 381166-1597083 & 3-4
[0] [0] MPI startup(): Allgatherv: 3: 1597083-2998114 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 0-2147483647 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 0-47 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 1: 47-103 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 103-438 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 2: 438-757 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 757-1453 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 2: 1453-3133 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 3133-6762 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 2: 6762-10802 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 10802-49917 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 49917-309996 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 309996-3739157 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Allreduce: 1: 804-1535 & 0-2
[0] [0] MPI startup(): Allreduce: 1: 2061-17116 & 0-2
[0] [0] MPI startup(): Allreduce: 2: 17116-37171 & 0-2
[0] [0] MPI startup(): Allreduce: 2: 344562-1048576 & 0-2
[0] [0] MPI startup(): Allreduce: 3: 37171-344562 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 0-804 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 1535-2061 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 1048576-3026207 & 0-2
[0] [0] MPI startup(): Allreduce: 4: 3026207-8388608 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 8388609-8635416 & 0-2
[0] [0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 0-6 & 3-4
[0] [0] MPI startup(): Allreduce: 4: 6-11 & 3-4
[0] [0] MPI startup(): Allreduce: 7: 11-49 & 3-4
[0] [0] MPI startup(): Allreduce: 6: 49-321 & 3-4
[0] [0] MPI startup(): Allreduce: 2: 321-720 & 3-4
[0] [0] MPI startup(): Allreduce: 4: 720-1375 & 3-4
[0] [0] MPI startup(): Allreduce: 1: 1375-173904 & 3-4
[0] [0] MPI startup(): Allreduce: 2: 173904-318383 & 3-4
[0] [0] MPI startup(): Allreduce: 7: 318383-1512039 & 3-4
[0] [0] MPI startup(): Allreduce: 6: 1512039-2561761 & 3-4
[0] [0] MPI startup(): Allreduce: 4: 2561762-8388608 & 3-4
[0] [0] MPI startup(): Allreduce: 7: 8388609-10618873 & 3-4
[0] [0] MPI startup(): Allreduce: 8: 0-2147483647 & 3-4
[0] [0] MPI startup(): Allreduce: 1: 0-11 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 11-24 & 5-8
[0] [0] MPI startup(): Allreduce: 6: 24-42 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 42-107 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 107-178 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 178-310 & 5-8
[0] [0] MPI startup(): Allreduce: 2: 310-594 & 5-8
[0] [0] MPI startup(): Allreduce: 5: 594-4431 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 4431-54874 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 54874-91696 & 5-8
[0] [0] MPI startup(): Allreduce: 6: 91696-175538 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 175538-383770 & 5-8
[0] [0] MPI startup(): Allreduce: 2: 383770-684262 & 5-8
[0] [0] MPI startup(): Allreduce: 3: 0-2147483647 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 0-11 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 11-24 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 6: 24-42 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 1: 42-107 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 107-178 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 1: 178-310 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 2: 310-594 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 5: 594-4431 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 1: 4431-54874 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 54874-91696 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 6: 91696-175538 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 175538-383770 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 2: 383770-32006608 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 3: 0-2147483647 & 9-2147483647
[0] [0] MPI startup(): Alltoall: 3: 0-129493 & 0-2
[0] [0] MPI startup(): Alltoall: 3: 1080889-3453431 & 0-2
[0] [0] MPI startup(): Alltoall: 2: 129493-1080889 & 0-2
[0] [0] MPI startup(): Alltoall: 2: 0-2147483647 & 0-2
[0] [0] MPI startup(): Alltoall: 2: 0-2147483647 & 3-4
[0] [0] MPI startup(): Alltoall: 1: 1-64 & 5-2147483647
[0] [0] MPI startup(): Alltoall: 2: 65-572235 & 5-2147483647
[0] [0] MPI startup(): Alltoall: 4: 572236-1736997 & 5-2147483647
[0] [0] MPI startup(): Alltoall: 3: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Alltoallv: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Alltoallv: 2: 0-2147483647 & 3-4
[0] [0] MPI startup(): Alltoallv: 2: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] [0] MPI startup(): Barrier: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Barrier: 6: 0-2147483647 & 3-4
[0] [0] MPI startup(): Barrier: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Bcast: 7: 0-8 & 0-2
[0] [0] MPI startup(): Bcast: 7: 24-64 & 0-2
[0] [0] MPI startup(): Bcast: 7: 11264-52186 & 0-2
[0] [0] MPI startup(): Bcast: 7: 112045-131072 & 0-2
[1] [1] MPI startup(): Recognition=2 Platform(code=512 ippn=0 dev=4) Fabric(intra=6 inter=6 flags=0x0)
[0] [0] MPI startup(): Bcast: 7: 1048576-2097152 & 0-2
[0] [0] MPI startup(): Bcast: 1: 8-24 & 0-2
[0] [0] MPI startup(): Bcast: 1: 64-11264 & 0-2
[0] [0] MPI startup(): Bcast: 1: 52186-112045 & 0-2
[0] [0] MPI startup(): Bcast: 1: 131072-1048576 & 0-2
[0] [0] MPI startup(): Bcast: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Bcast: 1: 1-1 & 3-4
[0] [0] MPI startup(): Bcast: 5: 2-3 & 3-4
[0] [0] MPI startup(): Bcast: 1: 4-5 & 3-4
[0] [0] MPI startup(): Bcast: 6: 6-11 & 3-4
[0] [0] MPI startup(): Bcast: 5: 12-24 & 3-4
[0] [0] MPI startup(): Bcast: 4: 25-141 & 3-4
[0] [0] MPI startup(): Bcast: 7: 142-370 & 3-4
[0] [0] MPI startup(): Bcast: 3: 371-680 & 3-4
[0] [0] MPI startup(): Bcast: 4: 681-3894 & 3-4
[0] [0] MPI startup(): Bcast: 1: 3895-4494 & 3-4
[0] [0] MPI startup(): Bcast: 7: 4495-14778 & 3-4
[0] [0] MPI startup(): Bcast: 4: 14779-18223 & 3-4
[0] [0] MPI startup(): Bcast: 7: 18224-36738 & 3-4
[0] [0] MPI startup(): Bcast: 3: 0-2147483647 & 3-4
[0] [0] MPI startup(): Bcast: 1: 0-10 & 5-2147483647
[0] [0] MPI startup(): Bcast: 1: 175-16799 & 5-2147483647
[0] [0] MPI startup(): Bcast: 6: 10-32 & 5-2147483647
[0] [0] MPI startup(): Bcast: 6: 32-175 & 5-2147483647
[0] [0] MPI startup(): Bcast: 7: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] [0] MPI startup(): Gather: 2: 73643-172031 & 0-2
[0] [0] MPI startup(): Gather: 3: 0-853 & 0-2
[0] [0] MPI startup(): Gather: 3: 54613-73643 & 0-2
[0] [0] MPI startup(): Gather: 3: 262144-524288 & 0-2
[0] [0] MPI startup(): Gather: 1: 853-54613 & 0-2
[0] [0] MPI startup(): Gather: 1: 172031-262144 & 0-2
[0] [0] MPI startup(): Gather: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Gather: 2: 34148-129691 & 3-2147483647
[0] [0] MPI startup(): Gather: 2: 503316-2506634 & 3-2147483647
[0] [0] MPI startup(): Gather: 3: 0-34148 & 3-2147483647
[0] [0] MPI startup(): Gather: 3: 129691-503316 & 3-2147483647
[0] [0] MPI startup(): Gather: 3: 0-2147483647 & 3-2147483647
[0] [0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Gatherv: 1: 0-2147483647 & 3-4
[0] [0] MPI startup(): Gatherv: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 4: 0-5 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 5-26 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 3: 26-47 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 47-98 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 3: 98-188 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 188-362 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 2: 362-588 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 588-1951 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 3: 1951-11702 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 11702-23138 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 23138-58229 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 58229-191964 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 2: 191964-2656092 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 4: 0-4 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 4-12 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 12-45 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 45-85 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 85-391 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 391-596 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 2: 596-1927 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 1927-2286 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 2286-7442 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 7442-10726 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 10726-45950 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 45950-101084 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 101084-159597 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 159597-423110 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 2: 423110-578734 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 578734-1329975 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 1329975-4146461 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 0-5 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 5-28 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 28-50 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 3: 50-197 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 197-721 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 2: 721-3207 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 3207-5980 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 5980-11416 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 3: 11416-104215 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 104215-277330 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 3: 277330-630522 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 630522-2659184 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Reduce: 4: 4-8 & 0-2
[0] [0] MPI startup(): Reduce: 3: 9-29 & 0-2
[0] [0] MPI startup(): Reduce: 2: 30-37 & 0-2
[0] [0] MPI startup(): Reduce: 3: 38-215 & 0-2
[0] [0] MPI startup(): Reduce: 2: 216-315 & 0-2
[0] [0] MPI startup(): Reduce: 5: 316-775 & 0-2
[0] [0] MPI startup(): Reduce: 2: 776-4045 & 0-2
[0] [0] MPI startup(): Reduce: 4: 4-6 & 3-4
[0] [0] MPI startup(): Reduce: 3: 7-11 & 3-4
[0] [0] MPI startup(): Reduce: 6: 12-16 & 3-4
[0] [0] MPI startup(): Reduce: 4: 17-34 & 3-4
[0] [0] MPI startup(): Reduce: 2: 35-99 & 3-4
[0] [0] MPI startup(): Reduce: 4: 100-230 & 3-4
[0] [0] MPI startup(): Reduce: 6: 231-275 & 3-4
[0] [0] MPI startup(): Reduce: 1: 276-1040 & 3-4
[0] [0] MPI startup(): Reduce: 3: 1041-3895 & 3-4
[0] [0] MPI startup(): Reduce: 6: 3896-4326 & 3-4
[0] [0] MPI startup(): Reduce: 3: 4327-10163 & 3-4
[0] [0] MPI startup(): Reduce: 1: 0-2147483647 & 3-4
[0] [0] MPI startup(): Reduce: 2: 4-26 & 5-2147483647
[0] [0] MPI startup(): Reduce: 4: 27-39 & 5-2147483647
[0] [0] MPI startup(): Reduce: 2: 40-230 & 5-2147483647
[0] [0] MPI startup(): Reduce: 3: 231-257 & 5-2147483647
[0] [0] MPI startup(): Reduce: 2: 258-718 & 5-2147483647
[0] [0] MPI startup(): Reduce: 3: 719-2436 & 5-2147483647
[0] [0] MPI startup(): Reduce: 4: 2437-6344 & 5-2147483647
[0] [0] MPI startup(): Reduce: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] [0] MPI startup(): Scatter: 1: 0-1 & 0-2
[0] [0] MPI startup(): Scatter: 1: 4-12 & 0-2
[0] [0] MPI startup(): Scatter: 1: 19-2048 & 0-2
[0] [0] MPI startup(): Scatter: 3: 2048-85701 & 0-2
[0] [0] MPI startup(): Scatter: 3: 165767-466939 & 0-2
[0] [0] MPI startup(): Scatter: 3: 524288-2336552 & 0-2
[0] [0] MPI startup(): Scatter: 2: 1-4 & 0-2
[0] [0] MPI startup(): Scatter: 2: 12-19 & 0-2
[0] [0] MPI startup(): Scatter: 2: 85701-165767 & 0-2
[0] [0] MPI startup(): Scatter: 2: 466939-524288 & 0-2
[0] [0] MPI startup(): Scatter: 2: 0-2147483647 & 0-2
[0] [0] MPI startup(): Scatter: 3: 0-1909200 & 3-2147483647
[0] [0] MPI startup(): Scatter: 2: 0-2147483647 & 3-2147483647
[0] [0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Scatterv: 1: 0-2147483647 & 3-4
[0] [0] MPI startup(): Scatterv: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Rank Pid Node name Pin cpu
[0] [0] MPI startup(): 0 125 jfz1r04h18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
[0] 30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
[0] ,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71}
[0] [0] MPI startup(): 1 97 jfz1r04h19 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
[0] 30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
[0] ,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71}
[0] [0] MPI startup(): Recognition=2 Platform(code=512 ippn=0 dev=4) Fabric(intra=6 inter=6 flags=0x0)
[0] [0] MPI startup(): I_MPI_COLL_INTRANODE=pt2pt
[0] [0] MPI startup(): I_MPI_DEBUG=6
[0] [0] MPI startup(): I_MPI_FABRICS=tcp
[0] [0] MPI startup(): I_MPI_FALLBACK=0
[0] [0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=hfi1_0:0,i40iw0:0,i40iw1:0
[0] [0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] [0] MPI startup(): I_MPI_PIN_MAPPING=1:0 0
[0] [0] MPI startup(): I_MPI_TCP_NETMASK=enp134s0f0
[0] I0427 11:07:09.561863 125 caffe.cpp:742] Number of groups: 1, group size: 2, number of parameter servers: 0
[1] I0427 11:07:09.567484 97 caffe.cpp:285] Use CPU.
[1] I0427 11:07:09.567849 97 solver.cpp:107] Initializing solver from parameters:
[1] test_iter: 100
[1] test_interval: 10000
[1] base_lr: 0.01
[1] display: 100
[1] max_iter: 50
[1] lr_policy: "inv"
[1] gamma: 0.0001
[1] power: 0.75
[1] momentum: 0.9
[1] weight_decay: 0.0005
[1] snapshot: 10000
[1] snapshot_prefix: "examples/mnist/lenet_mlsl"
[1] solver_mode: CPU
[1] net: "examples/mnist/lenet_train_test_mlsl.prototxt"
[1] train_state {
[1] level: 0
[1] stage: ""
[1] }
[1] I0427 11:07:09.567863 97 solver.cpp:153] Creating training net from net file: examples/mnist/lenet_train_test_mlsl.prototxt
[1] I0427 11:07:09.569360 97 cpu_info.cpp:453] Processor speed [MHz]: 2300
[1] I0427 11:07:09.569368 97 cpu_info.cpp:456] Total number of sockets: 2
[1] I0427 11:07:09.569372 97 cpu_info.cpp:459] Total number of CPU cores: 36
[1] I0427 11:07:09.569375 97 cpu_info.cpp:462] Total number of processors: 72
[1] I0427 11:07:09.569376 97 cpu_info.cpp:465] GPU is used: no
[1] I0427 11:07:09.569380 97 cpu_info.cpp:468] OpenMP environmental variables are specified: yes
[1] I0427 11:07:09.569381 97 cpu_info.cpp:471] OpenMP thread bind allowed: no
[0] I0427 11:07:09.565732 125 caffe.cpp:285] Use CPU.
[0] I0427 11:07:09.565878 125 solver.cpp:107] Initializing solver from parameters:
[0] test_iter: 100
[0] test_interval: 10000
[0] base_lr: 0.01
[0] display: 100
[0] max_iter: 50
[0] lr_policy: "inv"
[0] gamma: 0.0001
[0] power: 0.75
[0] momentum: 0.9
[0] weight_decay: 0.0005
[0] snapshot: 10000
[0] snapshot_prefix: "examples/mnist/lenet_mlsl"
[0] solver_mode: CPU
[0] net: "examples/mnist/lenet_train_test_mlsl.prototxt"
[0] train_state {
[0] level: 0
[0] stage: ""
[0] }
[0] I0427 11:07:09.565899 125 solver.cpp:153] Creating training net from net file: examples/mnist/lenet_train_test_mlsl.prototxt
[0] I0427 11:07:09.569108 125 cpu_info.cpp:453] Processor speed [MHz]: 2300
[0] I0427 11:07:09.569123 125 cpu_info.cpp:456] Total number of sockets: 2
[0] I0427 11:07:09.569128 125 cpu_info.cpp:459] Total number of CPU cores: 36
[0] I0427 11:07:09.569133 125 cpu_info.cpp:462] Total number of processors: 72
[0] I0427 11:07:09.569135 125 cpu_info.cpp:465] GPU is used: no
[0] I0427 11:07:09.569140 125 cpu_info.cpp:468] OpenMP environmental variables are specified: yes
[0] I0427 11:07:09.569144 125 cpu_info.cpp:471] OpenMP thread bind allowed: no
[1] I0427 11:07:09.576323 97 cpu_info.cpp:474] Number of OpenMP threads: 36
[1] I0427 11:07:09.576455 97 net.cpp:1052] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist
[1] I0427 11:07:09.576479 97 net.cpp:1052] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
[1] I0427 11:07:09.576784 97 net.cpp:207] Initializing net from parameters:
[1] I0427 11:07:09.576802 97 net.cpp:208]
[1] name: "LeNet"
[1] state {
[1] phase: TRAIN
[1] level: 0
[1] stage: ""
[1] }
[1] engine: "MKLDNN"
[1] compile_net_state {
[1] bn_scale_remove: false
[1] bn_scale_merge: false
[1] }
[1] layer {
[1] name: "mnist"
[1] type: "Data"
[1] top: "data"
[1] top: "label"
[1] include {
[1] phase: TRAIN
[1] }
[1] transform_param {
[1] scale: 0.00390625
[1] }
[1] data_param {
[1] source: "examples/mnist/mnist_train_lmdb"
[1] batch_size: 64
[1] backend: LMDB
[1] }
[1] }
[1] layer {
[1] name: "conv1"
[1] type: "Convolution"
[1] bottom: "data"
[1] top: "conv1"
[1] param {
[1] lr_mult: 1
[1] }
[1] convolution_param {
[1] num_output: 20
[1] bias_term: false
[1] kernel_size: 5
[1] stride: 1
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "pool1"
[1] type: "Pooling"
[1] bottom: "conv1"
[1] top: "pool1"
[1] pooling_param {
[1] pool: MAX
[1] kernel_size: 2
[1] stride: 2
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "conv2"
[1] type: "Convolution"
[1] bottom: "pool1"
[1] top: "conv2"
[1] param {
[1] lr_mult: 1
[1] }
[1] convolution_param {
[1] num_output: 50
[1] bias_term: false
[1] kernel_size: 5
[1] stride: 1
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "pool2"
[1] type: "Pooling"
[1] bottom: "conv2"
[1] top: "pool2"
[1] pooling_param {
[1] pool: MAX
[1] kernel_size: 2
[1] stride: 2
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "ip1"
[1] type: "InnerProduct"
[1] bottom: "pool2"
[1] top: "ip1"
[1] param {
[1] lr_mult: 1
[1] }
[1] inner_product_param {
[1] num_output: 500
[1] bias_term: false
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] }
[1] }
[1] layer {
[1] name: "relu1"
[1] type: "ReLU"
[1] bottom: "ip1"
[1] top: "ip1"
[1] relu_param {
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "ip2"
[1] type: "InnerProduct"
[1] bottom: "ip1"
[1] top: "ip2"
[1] param {
[1] lr_mult: 1
[1] }
[1] inner_product_param {
[1] num_output: 10
[1] bias_term: false
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] }
[1] }
[1] layer {
[1] name: "loss"
[1] type: "SoftmaxWithLoss"
[1] bottom: "ip2"
[1] bottom: "label"
[1] top: "loss"
[1] }
[1] I0427 11:07:09.576978 97 layer_factory.hpp:114] Creating layer mnist
[1] I0427 11:07:09.577211 97 net.cpp:265] Creating Layer mnist
[1] I0427 11:07:09.577231 97 net.cpp:1238] mnist -> data
[1] I0427 11:07:09.577255 97 net.cpp:1238] mnist -> label
[1] W0427 11:07:09.577289 97 net.cpp:335] SetMinibatchSize 64
[1] I0427 11:07:09.577648 99 internal_thread.cpp:135] Internal thread is affinitized to core 70
[1] I0427 11:07:09.577898 99 db_lmdb.cpp:72] Opened lmdb examples/mnist/mnist_train_lmdb
[1] I0427 11:07:09.578017 97 data_layer.cpp:80] output data size: 64,1,28,28
[0] I0427 11:07:09.576759 125 cpu_info.cpp:474] Number of OpenMP threads: 36
[0] I0427 11:07:09.576941 125 net.cpp:1052] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist
[0] I0427 11:07:09.576974 125 net.cpp:1052] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
[0] I0427 11:07:09.577446 125 net.cpp:207] Initializing net from parameters:
[0] I0427 11:07:09.577476 125 net.cpp:208]
[0] name: "LeNet"
[0] state {
[0] phase: TRAIN
[0] level: 0
[0] stage: ""
[0] }
[0] engine: "MKLDNN"
[0] compile_net_state {
[0] bn_scale_remove: false
[0] bn_scale_merge: false
[0] }
[0] layer {
[0] name: "mnist"
[0] type: "Data"
[0] top: "data"
[0] top: "label"
[0] include {
[0] phase: TRAIN
[0] }
[0] transform_param {
[0] scale: 0.00390625
[0] }
[0] data_param {
[0] source: "examples/mnist/mnist_train_lmdb"
[0] batch_size: 64
[0] backend: LMDB
[0] }
[0] }
[0] layer {
[0] name: "conv1"
[0] type: "Convolution"
[0] bottom: "data"
[0] top: "conv1"
[0] param {
[0] lr_mult: 1
[0] }
[0] convolution_param {
[0] num_output: 20
[0] bias_term: false
[0] kernel_size: 5
[0] stride: 1
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "pool1"
[0] type: "Pooling"
[0] bottom: "conv1"
[0] top: "pool1"
[0] pooling_param {
[0] pool: MAX
[0] kernel_size: 2
[0] stride: 2
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "conv2"
[0] type: "Convolution"
[0] bottom: "pool1"
[0] top: "conv2"
[0] param {
[0] lr_mult: 1
[0] }
[0] convolution_param {
[0] num_output: 50
[0] bias_term: false
[0] kernel_size: 5
[0] stride: 1
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "pool2"
[0] type: "Pooling"
[0] bottom: "conv2"
[0] top: "pool2"
[0] pooling_param {
[0] pool: MAX
[0] kernel_size: 2
[0] stride: 2
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "ip1"
[0] type: "InnerProduct"
[0] bottom: "pool2"
[0] top: "ip1"
[0] param {
[0] lr_mult: 1
[0] }
[0] inner_product_param {
[0] num_output: 500
[0] bias_term: false
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] }
[0] }
[0] layer {
[0] name: "relu1"
[0] type: "ReLU"
[0] bottom: "ip1"
[0] top: "ip1"
[0] relu_param {
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "ip2"
[0] type: "InnerProduct"
[0] bottom: "ip1"
[0] top: "ip2"
[0] param {
[0] lr_mult: 1
[0] }
[0] inner_product_param {
[0] num_output: 10
[0] bias_term: false
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] }
[0] }
[0] layer {
[0] name: "loss"
[0] type: "SoftmaxWithLoss"
[0] bottom: "ip2"
[0] bottom: "label"
[0] top: "loss"
[0] }
[0] I0427 11:07:09.577767 125 layer_factory.hpp:114] Creating layer mnist
[0] I0427 11:07:09.578042 125 net.cpp:265] Creating Layer mnist
[0] I0427 11:07:09.578061 125 net.cpp:1238] mnist -> data
[0] I0427 11:07:09.578085 125 net.cpp:1238] mnist -> label
[0] W0427 11:07:09.578212 125 net.cpp:335] SetMinibatchSize 64
[0] I0427 11:07:09.578627 127 internal_thread.cpp:135] Internal thread is affinitized to core 70
[0] I0427 11:07:09.578936 127 db_lmdb.cpp:72] Opened lmdb examples/mnist/mnist_train_lmdb
[0] I0427 11:07:09.579085 125 data_layer.cpp:80] output data size: 64,1,28,28
[0] I0427 11:07:09.580693 125 net.cpp:360] Setting up mnist
[0] I0427 11:07:09.580719 125 net.cpp:367] Top shape: 64 1 28 28 (50176)
[0] I0427 11:07:09.580727 125 net.cpp:367] Top shape: 64 (64)
[0] I0427 11:07:09.580734 125 net.cpp:375] Memory required for data: 200960
[0] I0427 11:07:09.580742 125 layer_factory.hpp:114] Creating layer conv1
[1] I0427 11:07:09.584623 97 net.cpp:360] Setting up mnist
[0] I0427 11:07:09.580778 125 net.cpp:265] Creating Layer conv1
[0] I0427 11:07:09.580785 125 net.cpp:1264] conv1 <- data
[1] I0427 11:07:09.584648 97 net.cpp:367] Top shape: 64 1 28 28 (50176)
[1] I0427 11:07:09.584658 97 net.cpp:367] Top shape: 64 (64)
[1] I0427 11:07:09.584663 97 net.cpp:375] Memory required for data: 200960
[1] I0427 11:07:09.584671 97 layer_factory.hpp:114] Creating layer conv1
[0] I0427 11:07:09.580799 125 net.cpp:1238] conv1 -> conv1
[1] I0427 11:07:09.584699 97 net.cpp:265] Creating Layer conv1
[1] I0427 11:07:09.584709 97 net.cpp:1264] conv1 <- data
[1] I0427 11:07:09.584722 97 net.cpp:1238] conv1 -> conv1
[1] I0427 11:07:09.593472 97 net.cpp:360] Setting up conv1
[1] I0427 11:07:09.593502 97 net.cpp:367] Top shape: 64 20 24 24 (737280)
[1] I0427 11:07:09.593509 97 net.cpp:375] Memory required for data: 3150080
[1] I0427 11:07:09.593533 97 layer_factory.hpp:114] Creating layer pool1
[1] I0427 11:07:09.593564 97 net.cpp:265] Creating Layer pool1
[1] I0427 11:07:09.593571 97 net.cpp:1264] pool1 <- conv1
[1] I0427 11:07:09.593588 97 net.cpp:1238] pool1 -> pool1
[1] I0427 11:07:09.593619 97 net.cpp:360] Setting up pool1
[1] I0427 11:07:09.593629 97 net.cpp:367] Top shape: 64 20 12 12 (184320)
[1] I0427 11:07:09.593634 97 net.cpp:375] Memory required for data: 3887360
[1] I0427 11:07:09.593641 97 layer_factory.hpp:114] Creating layer conv2
[1] I0427 11:07:09.593660 97 net.cpp:265] Creating Layer conv2
[1] I0427 11:07:09.593668 97 net.cpp:1264] conv2 <- pool1
[1] I0427 11:07:09.593679 97 net.cpp:1238] conv2 -> conv2
[0] I0427 11:07:09.589804 125 net.cpp:360] Setting up conv1
[0] I0427 11:07:09.589835 125 net.cpp:367] Top shape: 64 20 24 24 (737280)
[0] I0427 11:07:09.589841 125 net.cpp:375] Memory required for data: 3150080
[0] I0427 11:07:09.589869 125 layer_factory.hpp:114] Creating layer pool1
[0] I0427 11:07:09.589892 125 net.cpp:265] Creating Layer pool1
[0] I0427 11:07:09.589900 125 net.cpp:1264] pool1 <- conv1
[0] I0427 11:07:09.589924 125 net.cpp:1238] pool1 -> pool1
[0] I0427 11:07:09.589958 125 net.cpp:360] Setting up pool1
[0] I0427 11:07:09.589969 125 net.cpp:367] Top shape: 64 20 12 12 (184320)
[0] I0427 11:07:09.589975 125 net.cpp:375] Memory required for data: 3887360
[0] I0427 11:07:09.589982 125 layer_factory.hpp:114] Creating layer conv2
[0] I0427 11:07:09.590003 125 net.cpp:265] Creating Layer conv2
[0] I0427 11:07:09.590011 125 net.cpp:1264] conv2 <- pool1
[0] I0427 11:07:09.590021 125 net.cpp:1238] conv2 -> conv2
[1] I0427 11:07:09.596592 97 net.cpp:360] Setting up conv2
[1] I0427 11:07:09.596607 97 net.cpp:367] Top shape: 64 50 8 8 (204800)
[1] I0427 11:07:09.596612 97 net.cpp:375] Memory required for data: 4706560
[1] I0427 11:07:09.596623 97 layer_factory.hpp:114] Creating layer pool2
[1] I0427 11:07:09.596637 97 net.cpp:265] Creating Layer pool2
[1] I0427 11:07:09.596644 97 net.cpp:1264] pool2 <- conv2
[1] I0427 11:07:09.596655 97 net.cpp:1238] pool2 -> pool2
[1] I0427 11:07:09.596678 97 net.cpp:360] Setting up pool2
[1] I0427 11:07:09.596689 97 net.cpp:367] Top shape: 64 50 4 4 (51200)
[1] I0427 11:07:09.596695 97 net.cpp:375] Memory required for data: 4911360
[1] I0427 11:07:09.596700 97 layer_factory.hpp:114] Creating layer ip1
[1] I0427 11:07:09.596725 97 net.cpp:265] Creating Layer ip1
[1] I0427 11:07:09.596734 97 net.cpp:1264] ip1 <- pool2
[1] I0427 11:07:09.596745 97 net.cpp:1238] ip1 -> ip1
[0] I0427 11:07:09.593422 125 net.cpp:360] Setting up conv2
[0] I0427 11:07:09.593438 125 net.cpp:367] Top shape: 64 50 8 8 (204800)
[0] I0427 11:07:09.593443 125 net.cpp:375] Memory required for data: 4706560
[0] I0427 11:07:09.593453 125 layer_factory.hpp:114] Creating layer pool2
[0] I0427 11:07:09.593467 125 net.cpp:265] Creating Layer pool2
[0] I0427 11:07:09.593474 125 net.cpp:1264] pool2 <- conv2
[0] I0427 11:07:09.593487 125 net.cpp:1238] pool2 -> pool2
[0] I0427 11:07:09.593509 125 net.cpp:360] Setting up pool2
[0] I0427 11:07:09.593516 125 net.cpp:367] Top shape: 64 50 4 4 (51200)
[0] I0427 11:07:09.593523 125 net.cpp:375] Memory required for data: 4911360
[0] I0427 11:07:09.593528 125 layer_factory.hpp:114] Creating layer ip1
[0] I0427 11:07:09.593554 125 net.cpp:265] Creating Layer ip1
[0] I0427 11:07:09.593562 125 net.cpp:1264] ip1 <- pool2
[0] I0427 11:07:09.593575 125 net.cpp:1238] ip1 -> ip1
[1] I0427 11:07:09.601646 97 net.cpp:360] Setting up ip1
[1] I0427 11:07:09.601660 97 net.cpp:367] Top shape: 64 500 (32000)
[1] I0427 11:07:09.601665 97 net.cpp:375] Memory required for data: 5039360
[1] I0427 11:07:09.601673 97 layer_factory.hpp:114] Creating layer relu1
[1] I0427 11:07:09.601687 97 net.cpp:265] Creating Layer relu1
[1] I0427 11:07:09.601693 97 net.cpp:1264] relu1 <- ip1
[1] I0427 11:07:09.601701 97 net.cpp:1225] relu1 -> ip1 (in-place)
[1] I0427 11:07:09.601728 97 net.cpp:360] Setting up relu1
[1] I0427 11:07:09.601734 97 net.cpp:367] Top shape: 64 500 (32000)
[1] I0427 11:07:09.601737 97 net.cpp:375] Memory required for data: 5167360
[1] I0427 11:07:09.601742 97 layer_factory.hpp:114] Creating layer ip2
[1] I0427 11:07:09.601764 97 net.cpp:265] Creating Layer ip2
[1] I0427 11:07:09.601769 97 net.cpp:1264] ip2 <- ip1
[1] I0427 11:07:09.601778 97 net.cpp:1238] ip2 -> ip2
[1] I0427 11:07:09.601825 97 net.cpp:360] Setting up ip2
[1] I0427 11:07:09.601832 97 net.cpp:367] Top shape: 64 10 (640)
[1] I0427 11:07:09.601836 97 net.cpp:375] Memory required for data: 5169920
[1] I0427 11:07:09.601841 97 layer_factory.hpp:114] Creating layer loss
[1] I0427 11:07:09.601852 97 net.cpp:265] Creating Layer loss
[1] I0427 11:07:09.601857 97 net.cpp:1264] loss <- ip2
[1] I0427 11:07:09.601882 97 net.cpp:1264] loss <- label
[0] I0427 11:07:09.598147 125 net.cpp:360] Setting up ip1
[0] I0427 11:07:09.598165 125 net.cpp:367] Top shape: 64 500 (32000)
[0] I0427 11:07:09.598167 125 net.cpp:375] Memory required for data: 5039360
[0] I0427 11:07:09.598176 125 layer_factory.hpp:114] Creating layer relu1
[0] I0427 11:07:09.598192 125 net.cpp:265] Creating Layer relu1
[0] I0427 11:07:09.598196 125 net.cpp:1264] relu1 <- ip1
[0] I0427 11:07:09.598202 125 net.cpp:1225] relu1 -> ip1 (in-place)
[1] I0427 11:07:09.601887 97 net.cpp:1238] loss -> loss
[0] I0427 11:07:09.598230 125 net.cpp:360] Setting up relu1
[0] I0427 11:07:09.598235 125 net.cpp:367] Top shape: 64 500 (32000)
[0] I0427 11:07:09.598253 125 net.cpp:375] Memory required for data: 5167360
[0] I0427 11:07:09.598256 125 layer_factory.hpp:114] Creating layer ip2
[1] I0427 11:07:09.601902 97 layer_factory.hpp:114] Creating layer loss
[0] I0427 11:07:09.598268 125 net.cpp:265] Creating Layer ip2
[0] I0427 11:07:09.598270 125 net.cpp:1264] ip2 <- ip1
[1] I0427 11:07:09.601929 97 net.cpp:360] Setting up loss
[1] I0427 11:07:09.601938 97 net.cpp:367] Top shape: (1)
[0] I0427 11:07:09.598278 125 net.cpp:1238] ip2 -> ip2
[1] I0427 11:07:09.601940 97 net.cpp:370] with loss weight 0.5
[1] I0427 11:07:09.601959 97 net.cpp:375] Memory required for data: 5169924
[0] I0427 11:07:09.598335 125 net.cpp:360] Setting up ip2
[0] I0427 11:07:09.598343 125 net.cpp:367] Top shape: 64 10 (640)
[0] I0427 11:07:09.598346 125 net.cpp:375] Memory required for data: 5169920
[1] I0427 11:07:09.601963 97 net.cpp:437] loss needs backward computation.
[1] I0427 11:07:09.601968 97 net.cpp:437] ip2 needs backward computation.
[0] I0427 11:07:09.598352 125 layer_factory.hpp:114] Creating layer loss
[1] I0427 11:07:09.601971 97 net.cpp:437] relu1 needs backward computation.
[0] I0427 11:07:09.598363 125 net.cpp:265] Creating Layer loss
[1] I0427 11:07:09.601975 97 net.cpp:437] ip1 needs backward computation.
[0] I0427 11:07:09.598367 125 net.cpp:1264] loss <- ip2
[0] I0427 11:07:09.598397 125 net.cpp:1264] loss <- label
[1] I0427 11:07:09.601984 97 net.cpp:437] pool2 needs backward computation.
[0] I0427 11:07:09.598403 125 net.cpp:1238] loss -> loss
[1] I0427 11:07:09.601986 97 net.cpp:437] conv2 needs backward computation.
[1] I0427 11:07:09.601992 97 net.cpp:437] pool1 needs backward computation.
[0] I0427 11:07:09.598418 125 layer_factory.hpp:114] Creating layer loss
[1] I0427 11:07:09.601997 97 net.cpp:437] conv1 needs backward computation.
[0] I0427 11:07:09.598453 125 net.cpp:360] Setting up loss
[0] I0427 11:07:09.598461 125 net.cpp:367] Top shape: (1)
[1] I0427 11:07:09.602001 97 net.cpp:439] mnist does not need backward computation.
[0] I0427 11:07:09.598465 125 net.cpp:370] with loss weight 0.5
[0] I0427 11:07:09.598489 125 net.cpp:375] Memory required for data: 5169924
[0] I0427 11:07:09.598493 125 net.cpp:437] loss needs backward computation.
[1] I0427 11:07:09.602005 97 net.cpp:481] This network produces output loss
[0] I0427 11:07:09.598498 125 net.cpp:437] ip2 needs backward computation.
[1] I0427 11:07:09.602017 97 net.cpp:521] Network initialization done.
[0] I0427 11:07:09.598502 125 net.cpp:437] relu1 needs backward computation.
[0] I0427 11:07:09.598505 125 net.cpp:437] ip1 needs backward computation.
[0] I0427 11:07:09.598510 125 net.cpp:437] pool2 needs backward computation.
[1] I0427 11:07:09.602191 97 solver.cpp:249] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test_mlsl.prototxt
[1] I0427 11:07:09.602200 97 cpu_info.cpp:453] Processor speed [MHz]: 2300
[0] I0427 11:07:09.598515 125 net.cpp:437] conv2 needs backward computation.
[1] I0427 11:07:09.602202 97 cpu_info.cpp:456] Total number of sockets: 2
[1] I0427 11:07:09.602205 97 cpu_info.cpp:459] Total number of CPU cores: 36
[1] I0427 11:07:09.602210 97 cpu_info.cpp:462] Total number of processors: 72
[0] I0427 11:07:09.598520 125 net.cpp:437] pool1 needs backward computation.
[1] I0427 11:07:09.602213 97 cpu_info.cpp:465] GPU is used: no
[1] I0427 11:07:09.602217 97 cpu_info.cpp:468] OpenMP environmental variables are specified: yes
[0] I0427 11:07:09.598523 125 net.cpp:437] conv1 needs backward computation.
[0] I0427 11:07:09.598528 125 net.cpp:439] mnist does not need backward computation.
[0] I0427 11:07:09.598531 125 net.cpp:481] This network produces output loss
[1] I0427 11:07:09.602221 97 cpu_info.cpp:471] OpenMP thread bind allowed: no
[0] I0427 11:07:09.598546 125 net.cpp:521] Network initialization done.
[1] I0427 11:07:09.602226 97 cpu_info.cpp:474] Number of OpenMP threads: 36
[0] I0427 11:07:09.598822 125 solver.cpp:249] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test_mlsl.prototxt
[0] I0427 11:07:09.598834 125 cpu_info.cpp:453] Processor speed [MHz]: 2300
[1] I0427 11:07:09.602244 97 net.cpp:1052] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist
[0] I0427 11:07:09.598839 125 cpu_info.cpp:456] Total number of sockets: 2
[0] I0427 11:07:09.598841 125 cpu_info.cpp:459] Total number of CPU cores: 36
[0] I0427 11:07:09.598845 125 cpu_info.cpp:462] Total number of processors: 72
[0] I0427 11:07:09.598848 125 cpu_info.cpp:465] GPU is used: no
[1] I0427 11:07:09.602470 97 net.cpp:207] Initializing net from parameters:
[0] I0427 11:07:09.598852 125 cpu_info.cpp:468] OpenMP environmental variables are specified: yes
[0] I0427 11:07:09.598855 125 cpu_info.cpp:471] OpenMP thread bind allowed: no
[1] I0427 11:07:09.602483 97 net.cpp:208]
[0] I0427 11:07:09.598858 125 cpu_info.cpp:474] Number of OpenMP threads: 36
[1] name: "LeNet"
[1] state {
[1] phase: TEST
[1] }
[1] engine: "MKLDNN"
[1] compile_net_state {
[1] bn_scale_remove: false
[1] bn_scale_merge: false
[1] }
[1] layer {
[1] name: "mnist"
[1] type: "Data"
[1] top: "data"
[1] top: "label"
[1] include {
[1] phase: TEST
[1] }
[1] transform_param {
[1] scale: 0.00390625
[1] }
[1] data_param {
[1] source: "examples/mnist/mnist_test_lmdb"
[1] batch_size: 100
[1] backend: LMDB
[1] }
[1] }
[1] layer {
[1] name: "label_mnist_1_split"
[1] type: "Split"
[1] bottom: "label"
[1] top: "label_mnist_1_split_0"
[1] top: "label_mnist_1_split_1"
[1] }
[1] layer {
[1] name: "conv1"
[1] type: "Convolution"
[1] bottom: "data"
[1] top: "conv1"
[1] param {
[1] lr_mult: 1
[1] }
[1] convolution_param {
[1] num_output: 20
[1] bias_term: false
[1] kernel_size: 5
[1] stride: 1
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "pool1"
[1] type: "Pooling"
[1] bottom: "conv1"
[1] top: "pool1"
[1] pooling_param {
[1] pool: MAX
[1] kernel_size: 2
[1] stride: 2
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "conv2"
[1] type: "Convolution"
[1] bottom: "pool1"
[1] top: "conv2"
[1] param {
[1] lr_mult: 1
[1] }
[1] convolution_param {
[1] num_output: 50
[1] bias_term: false
[1] kernel_size: 5
[1] stride: 1
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "pool2"
[1] type: "Pooling"
[1] bottom: "conv2"
[1] top: "pool2"
[1] pooling_param {
[1] pool: MAX
[1] kernel_size: 2
[1] stride: 2
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "ip1"
[1] type: "InnerProduct"
[1] bottom: "pool2"
[1] top: "ip1"
[1] param {
[1] lr_mult: 1
[1] }
[1] inner_product_param {
[1] num_output: 500
[1] bias_term: false
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] }
[1] }
[1] layer {
[1] name: "relu1"
[1] type: "ReLU"
[1] bottom: "ip1"
[1] top: "ip1"
[1] relu_param {
[1] engine: MKL2017
[1] }
[1] }
[1] layer {
[1] name: "ip2"
[1] type: "InnerProduct"
[1] bottom: "ip1"
[1] top: "ip2"
[1] param {
[1] lr_mult: 1
[1] }
[1] inner_product_param {
[1] num_output: 10
[1] bias_term: false
[1] weight_filler {
[1] type: "xavier"
[1] }
[1] }
[1] }
[1] layer {
[1] name: "ip2_ip2_0_split"
[1] type: "Split"
[1] bottom: "ip2"
[1] top: "ip2_ip2_0_split_0"
[1] top: "ip2_ip2_0_split_1"
[1] }
[1] layer {
[1] name: "accuracy"
[1] type: "Accuracy"
[1] bottom: "ip2_ip2_0_split_0"
[1] bottom: "label_mnist_1_split_0"
[1] top: "accuracy"
[1] include {
[1] phase: TEST
[1] }
[1] }
[1] layer {
[1] name: "loss"
[1] type: "SoftmaxWithLoss"
[1] bottom: "ip2_ip2_0_split_1"
[1] bottom: "label_mnist_1_split_1"
[1] top: "loss"
[1] }
[0] I0427 11:07:09.598881 125 net.cpp:1052] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist
[1] I0427 11:07:09.602598 97 layer_factory.hpp:114] Creating layer mnist
[0] I0427 11:07:09.599122 125 net.cpp:207] Initializing net from parameters:
[1] I0427 11:07:09.602694 97 net.cpp:265] Creating Layer mnist
[1] I0427 11:07:09.602701 97 net.cpp:1238] mnist -> data
[0] I0427 11:07:09.599133 125 net.cpp:208]
[1] I0427 11:07:09.602710 97 net.cpp:1238] mnist -> label
[0] name: "LeNet"
[0] state {
[0] phase: TEST
[0] }
[0] engine: "MKLDNN"
[0] compile_net_state {
[0] bn_scale_remove: false
[0] bn_scale_merge: false
[0] }
[0] layer {
[0] name: "mnist"
[0] type: "Data"
[0] top: "data"
[0] top: "label"
[0] include {
[0] phase: TEST
[0] }
[0] transform_param {
[0] scale: 0.00390625
[0] }
[0] data_param {
[0] source: "examples/mnist/mnist_test_lmdb"
[0] batch_size: 100
[0] backend: LMDB
[0] }
[0] }
[0] layer {
[0] name: "label_mnist_1_split"
[0] type: "Split"
[0] bottom: "label"
[0] top: "label_mnist_1_split_0"
[0] top: "label_mnist_1_split_1"
[0] }
[0] layer {
[0] name: "conv1"
[0] type: "Convolution"
[0] bottom: "data"
[0] top: "conv1"
[0] param {
[0] lr_mult: 1
[0] }
[0] convolution_param {
[0] num_output: 20
[0] bias_term: false
[0] kernel_size: 5
[0] stride: 1
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "pool1"
[0] type: "Pooling"
[0] bottom: "conv1"
[0] top: "pool1"
[0] pooling_param {
[0] pool: MAX
[0] kernel_size: 2
[0] stride: 2
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "conv2"
[0] type: "Convolution"
[0] bottom: "pool1"
[0] top: "conv2"
[0] param {
[0] lr_mult: 1
[0] }
[0] convolution_param {
[0] num_output: 50
[0] bias_term: false
[0] kernel_size: 5
[0] stride: 1
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "pool2"
[0] type: "Pooling"
[0] bottom: "conv2"
[0] top: "pool2"
[0] pooling_param {
[0] pool: MAX
[0] kernel_size: 2
[0] stride: 2
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "ip1"
[0] type: "InnerProduct"
[0] bottom: "pool2"
[0] top: "ip1"
[0] param {
[0] lr_mult: 1
[0] }
[0] inner_product_param {
[0] num_output: 500
[0] bias_term: false
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] }
[0] }
[0] layer {
[0] name: "relu1"
[0] type: "ReLU"
[0] bottom: "ip1"
[0] top: "ip1"
[0] relu_param {
[0] engine: MKL2017
[0] }
[0] }
[0] layer {
[0] name: "ip2"
[0] type: "InnerProduct"
[0] bottom: "ip1"
[0] top: "ip2"
[0] param {
[0] lr_mult: 1
[0] }
[0] inner_product_param {
[0] num_output: 10
[0] bias_term: false
[0] weight_filler {
[0] type: "xavier"
[0] }
[0] }
[0] }
[0] layer {
[0] name: "ip2_ip2_0_split"
[0] type: "Split"
[0] bottom: "ip2"
[0] top: "ip2_ip2_0_split_0"
[0] top: "ip2_ip2_0_split_1"
[0] }
[0] layer {
[0] name: "accuracy"
[0] type: "Accuracy"
[0] bottom: "ip2_ip2_0_split_0"
[0] bottom: "label_mnist_1_split_0"
[0] top: "accuracy"
[0] include {
[0] phase: TEST
[0] }
[0] }
[0] layer {
[0] name: "loss"
[0] type: "SoftmaxWithLoss"
[0] bottom: "ip2_ip2_0_split_1"
[0] bottom: "label_mnist_1_split_1"
[0] top: "loss"
[0] }
[1] I0427 11:07:09.602819 100 internal_thread.cpp:135] Internal thread is affinitized to core 71
[0] I0427 11:07:09.599277 125 layer_factory.hpp:114] Creating layer mnist
[1] I0427 11:07:09.602931 100 db_lmdb.cpp:72] Opened lmdb examples/mnist/mnist_test_lmdb
[0] I0427 11:07:09.599380 125 net.cpp:265] Creating Layer mnist
[0] I0427 11:07:09.599388 125 net.cpp:1238] mnist -> data
[1] I0427 11:07:09.602979 97 data_layer.cpp:80] output data size: 100,1,28,28
[0] I0427 11:07:09.599397 125 net.cpp:1238] mnist -> label
[1] I0427 11:07:09.603960 97 net.cpp:360] Setting up mnist
[1] I0427 11:07:09.603971 97 net.cpp:367] Top shape: 100 1 28 28 (78400)
[0] I0427 11:07:09.599514 128 internal_thread.cpp:135] Internal thread is affinitized to core 71
[1] I0427 11:07:09.603976 97 net.cpp:367] Top shape: 100 (100)
[1] I0427 11:07:09.603979 97 net.cpp:375] Memory required for data: 314000
[0] I0427 11:07:09.599640 128 db_lmdb.cpp:72] Opened lmdb examples/mnist/mnist_test_lmdb
[1] I0427 11:07:09.603984 97 layer_factory.hpp:114] Creating layer label_mnist_1_split
[0] I0427 11:07:09.599706 125 data_layer.cpp:80] output data size: 100,1,28,28
[1] I0427 11:07:09.603998 97 net.cpp:265] Creating Layer label_mnist_1_split
[1] I0427 11:07:09.604004 97 net.cpp:1264] label_mnist_1_split <- label
[1] I0427 11:07:09.604012 97 net.cpp:1238] label_mnist_1_split -> label_mnist_1_split_0
[1] I0427 11:07:09.604019 97 net.cpp:1238] label_mnist_1_split -> label_mnist_1_split_1
[1] I0427 11:07:09.604041 97 net.cpp:360] Setting up label_mnist_1_split
[1] I0427 11:07:09.604048 97 net.cpp:367] Top shape: 100 (100)
[1] I0427 11:07:09.604053 97 net.cpp:367] Top shape: 100 (100)
[1] I0427 11:07:09.604055 97 net.cpp:375] Memory required for data: 314800
[1] I0427 11:07:09.604060 97 layer_factory.hpp:114] Creating layer conv1
[1] I0427 11:07:09.604071 97 net.cpp:265] Creating Layer conv1
[1] I0427 11:07:09.604077 97 net.cpp:1264] conv1 <- data
[1] I0427 11:07:09.604084 97 net.cpp:1238] conv1 -> conv1
[0] I0427 11:07:09.600899 125 net.cpp:360] Setting up mnist
[0] I0427 11:07:09.600913 125 net.cpp:367] Top shape: 100 1 28 28 (78400)
[0] I0427 11:07:09.600919 125 net.cpp:367] Top shape: 100 (100)
[0] I0427 11:07:09.600924 125 net.cpp:375] Memory required for data: 314000
[0] I0427 11:07:09.600927 125 layer_factory.hpp:114] Creating layer label_mnist_1_split
[0] I0427 11:07:09.600944 125 net.cpp:265] Creating Layer label_mnist_1_split
[0] I0427 11:07:09.600950 125 net.cpp:1264] label_mnist_1_split <- label
[0] I0427 11:07:09.600958 125 net.cpp:1238] label_mnist_1_split -> label_mnist_1_split_0
[0] I0427 11:07:09.600972 125 net.cpp:1238] label_mnist_1_split -> label_mnist_1_split_1
[0] I0427 11:07:09.600991 125 net.cpp:360] Setting up label_mnist_1_split
[0] I0427 11:07:09.600998 125 net.cpp:367] Top shape: 100 (100)
[0] I0427 11:07:09.601004 125 net.cpp:367] Top shape: 100 (100)
[0] I0427 11:07:09.601007 125 net.cpp:375] Memory required for data: 314800
[0] I0427 11:07:09.601011 125 layer_factory.hpp:114] Creating layer conv1
[0] I0427 11:07:09.601023 125 net.cpp:265] Creating Layer conv1
[0] I0427 11:07:09.601032 125 net.cpp:1264] conv1 <- data
[0] I0427 11:07:09.601037 125 net.cpp:1238] conv1 -> conv1
[1] I0427 11:07:09.605427 97 net.cpp:360] Setting up conv1
[1] I0427 11:07:09.605438 97 net.cpp:367] Top shape: 100 20 24 24 (1152000)
[1] I0427 11:07:09.605442 97 net.cpp:375] Memory required for data: 4922800
[1] I0427 11:07:09.605449 97 layer_factory.hpp:114] Creating layer pool1
[1] I0427 11:07:09.605459 97 net.cpp:265] Creating Layer pool1
[1] I0427 11:07:09.605468 97 net.cpp:1264] pool1 <- conv1
[1] I0427 11:07:09.605473 97 net.cpp:1238] pool1 -> pool1
[1] I0427 11:07:09.605485 97 net.cpp:360] Setting up pool1
[1] I0427 11:07:09.605494 97 net.cpp:367] Top shape: 100 20 12 12 (288000)
[1] I0427 11:07:09.605496 97 net.cpp:375] Memory required for data: 6074800
[1] I0427 11:07:09.605500 97 layer_factory.hpp:114] Creating layer conv2
[1] I0427 11:07:09.605515 97 net.cpp:265] Creating Layer conv2
[1] I0427 11:07:09.605520 97 net.cpp:1264] conv2 <- pool1
[1] I0427 11:07:09.605531 97 net.cpp:1238] conv2 -> conv2
[0] I0427 11:07:09.602587 125 net.cpp:360] Setting up conv1
[0] I0427 11:07:09.602598 125 net.cpp:367] Top shape: 100 20 24 24 (1152000)
[0] I0427 11:07:09.602602 125 net.cpp:375] Memory required for data: 4922800
[0] I0427 11:07:09.602609 125 layer_factory.hpp:114] Creating layer pool1
[0] I0427 11:07:09.602619 125 net.cpp:265] Creating Layer pool1
[0] I0427 11:07:09.602623 125 net.cpp:1264] pool1 <- conv1
[0] I0427 11:07:09.602630 125 net.cpp:1238] pool1 -> pool1
[0] I0427 11:07:09.602644 125 net.cpp:360] Setting up pool1
[0] I0427 11:07:09.602649 125 net.cpp:367] Top shape: 100 20 12 12 (288000)
[0] I0427 11:07:09.602654 125 net.cpp:375] Memory required for data: 6074800
[0] I0427 11:07:09.602659 125 layer_factory.hpp:114] Creating layer conv2
[0] I0427 11:07:09.602682 125 net.cpp:265] Creating Layer conv2
[0] I0427 11:07:09.602685 125 net.cpp:1264] conv2 <- pool1
[0] I0427 11:07:09.602691 125 net.cpp:1238] conv2 -> conv2
[1] I0427 11:07:09.607254 97 net.cpp:360] Setting up conv2
[1] I0427 11:07:09.607264 97 net.cpp:367] Top shape: 100 50 8 8 (320000)
[1] I0427 11:07:09.607269 97 net.cpp:375] Memory required for data: 7354800
[1] I0427 11:07:09.607275 97 layer_factory.hpp:114] Creating layer pool2
[1] I0427 11:07:09.607285 97 net.cpp:265] Creating Layer pool2
[1] I0427 11:07:09.607290 97 net.cpp:1264] pool2 <- conv2
[1] I0427 11:07:09.607298 97 net.cpp:1238] pool2 -> pool2
[1] I0427 11:07:09.607311 97 net.cpp:360] Setting up pool2
[1] I0427 11:07:09.607318 97 net.cpp:367] Top shape: 100 50 4 4 (80000)
[1] I0427 11:07:09.607322 97 net.cpp:375] Memory required for data: 7674800
[1] I0427 11:07:09.607336 97 layer_factory.hpp:114] Creating layer ip1
[1] I0427 11:07:09.607347 97 net.cpp:265] Creating Layer ip1
[1] I0427 11:07:09.607355 97 net.cpp:1264] ip1 <- pool2
[1] I0427 11:07:09.607362 97 net.cpp:1238] ip1 -> ip1
[0] I0427 11:07:09.604876 125 net.cpp:360] Setting up conv2
[0] I0427 11:07:09.604885 125 net.cpp:367] Top shape: 100 50 8 8 (320000)
[0] I0427 11:07:09.604889 125 net.cpp:375] Memory required for data: 7354800
[0] I0427 11:07:09.604895 125 layer_factory.hpp:114] Creating layer pool2
[0] I0427 11:07:09.604905 125 net.cpp:265] Creating Layer pool2
[0] I0427 11:07:09.604909 125 net.cpp:1264] pool2 <- conv2
[0] I0427 11:07:09.604917 125 net.cpp:1238] pool2 -> pool2
[0] I0427 11:07:09.604931 125 net.cpp:360] Setting up pool2
[0] I0427 11:07:09.604939 125 net.cpp:367] Top shape: 100 50 4 4 (80000)
[0] I0427 11:07:09.604943 125 net.cpp:375] Memory required for data: 7674800
[0] I0427 11:07:09.604948 125 layer_factory.hpp:114] Creating layer ip1
[0] I0427 11:07:09.604955 125 net.cpp:265] Creating Layer ip1
[0] I0427 11:07:09.604960 125 net.cpp:1264] ip1 <- pool2
[0] I0427 11:07:09.604975 125 net.cpp:1238] ip1 -> ip1
[1] I0427 11:07:09.610730 97 net.cpp:360] Setting up ip1
[1] I0427 11:07:09.610744 97 net.cpp:367] Top shape: 100 500 (50000)
[1] I0427 11:07:09.610749 97 net.cpp:375] Memory required for data: 7874800
[1] I0427 11:07:09.610756 97 layer_factory.hpp:114] Creating layer relu1
[1] I0427 11:07:09.610769 97 net.cpp:265] Creating Layer relu1
[1] I0427 11:07:09.610775 97 net.cpp:1264] relu1 <- ip1
[1] I0427 11:07:09.610783 97 net.cpp:1225] relu1 -> ip1 (in-place)
[1] I0427 11:07:09.610796 97 net.cpp:360] Setting up relu1
[1] I0427 11:07:09.610803 97 net.cpp:367] Top shape: 100 500 (50000)
[1] I0427 11:07:09.610807 97 net.cpp:375] Memory required for data: 8074800
[1] I0427 11:07:09.610828 97 layer_factory.hpp:114] Creating layer ip2
[1] I0427 11:07:09.610841 97 net.cpp:265] Creating Layer ip2
[1] I0427 11:07:09.610847 97 net.cpp:1264] ip2 <- ip1
[1] I0427 11:07:09.610855 97 net.cpp:1238] ip2 -> ip2
[1] I0427 11:07:09.610903 97 net.cpp:360] Setting up ip2
[1] I0427 11:07:09.610909 97 net.cpp:367] Top shape: 100 10 (1000)
[1] I0427 11:07:09.610913 97 net.cpp:375] Memory required for data: 8078800
[1] I0427 11:07:09.610920 97 layer_factory.hpp:114] Creating layer ip2_ip2_0_split
[1] I0427 11:07:09.610929 97 net.cpp:265] Creating Layer ip2_ip2_0_split
[1] I0427 11:07:09.610931 97 net.cpp:1264] ip2_ip2_0_split <- ip2
[1] I0427 11:07:09.610939 97 net.cpp:1238] ip2_ip2_0_split -> ip2_ip2_0_split_0
[1] I0427 11:07:09.610947 97 net.cpp:1238] ip2_ip2_0_split -> ip2_ip2_0_split_1
[1] I0427 11:07:09.610955 97 net.cpp:360] Setting up ip2_ip2_0_split
[1] I0427 11:07:09.610960 97 net.cpp:367] Top shape: 100 10 (1000)
[1] I0427 11:07:09.610965 97 net.cpp:367] Top shape: 100 10 (1000)
[1] I0427 11:07:09.610967 97 net.cpp:375] Memory required for data: 8086800
[1] I0427 11:07:09.610971 97 layer_factory.hpp:114] Creating layer accuracy
[1] I0427 11:07:09.610980 97 net.cpp:265] Creating Layer accuracy
[1] I0427 11:07:09.610987 97 net.cpp:1264] accuracy <- ip2_ip2_0_split_0
[1] I0427 11:07:09.610991 97 net.cpp:1264] accuracy <- label_mnist_1_split_0
[1] I0427 11:07:09.610999 97 net.cpp:1238] accuracy -> accuracy
[1] I0427 11:07:09.611009 97 net.cpp:360] Setting up accuracy
[1] I0427 11:07:09.611014 97 net.cpp:367] Top shape: (1)
[1] I0427 11:07:09.611018 97 net.cpp:375] Memory required for data: 8086804
[1] I0427 11:07:09.611021 97 layer_factory.hpp:114] Creating layer loss
[1] I0427 11:07:09.611028 97 net.cpp:265] Creating Layer loss
[1] I0427 11:07:09.611032 97 net.cpp:1264] loss <- ip2_ip2_0_split_1
[1] I0427 11:07:09.611037 97 net.cpp:1264] loss <- label_mnist_1_split_1
[1] I0427 11:07:09.611042 97 net.cpp:1238] loss -> loss
[1] I0427 11:07:09.611052 97 layer_factory.hpp:114] Creating layer loss
[1] I0427 11:07:09.611073 97 net.cpp:360] Setting up loss
[1] I0427 11:07:09.611081 97 net.cpp:367] Top shape: (1)
[1] I0427 11:07:09.611084 97 net.cpp:370] with loss weight 0.5
[1] I0427 11:07:09.611091 97 net.cpp:375] Memory required for data: 8086808
[1] I0427 11:07:09.611095 97 net.cpp:437] loss needs backward computation.
[1] I0427 11:07:09.611100 97 net.cpp:439] accuracy does not need backward computation.
[1] I0427 11:07:09.611105 97 net.cpp:437] ip2_ip2_0_split needs backward computation.
[1] I0427 11:07:09.611109 97 net.cpp:437] ip2 needs backward computation.
[1] I0427 11:07:09.611114 97 net.cpp:437] relu1 needs backward computation.
[1] I0427 11:07:09.611116 97 net.cpp:437] ip1 needs backward computation.
[1] I0427 11:07:09.611120 97 net.cpp:437] pool2 needs backward computation.
[1] I0427 11:07:09.611124 97 net.cpp:437] conv2 needs backward computation.
[1] I0427 11:07:09.611129 97 net.cpp:437] pool1 needs backward computation.
[1] I0427 11:07:09.611132 97 net.cpp:437] conv1 needs backward computation.
[1] I0427 11:07:09.611136 97 net.cpp:439] label_mnist_1_split does not need backward computation.
[1] I0427 11:07:09.611141 97 net.cpp:439] mnist does not need backward computation.
[1] I0427 11:07:09.611147 97 net.cpp:481] This network produces output accuracy
[1] I0427 11:07:09.611150 97 net.cpp:481] This network produces output loss
[1] I0427 11:07:09.611165 97 net.cpp:521] Network initialization done.
[1] I0427 11:07:09.611233 97 solver.cpp:121] Solver scaffolding done.
[0] I0427 11:07:09.608055 125 net.cpp:360] Setting up ip1
[0] I0427 11:07:09.608070 125 net.cpp:367] Top shape: 100 500 (50000)
[0] I0427 11:07:09.608073 125 net.cpp:375] Memory required for data: 7874800
[0] I0427 11:07:09.608079 125 layer_factory.hpp:114] Creating layer relu1
[0] I0427 11:07:09.608090 125 net.cpp:265] Creating Layer relu1
[0] I0427 11:07:09.608093 125 net.cpp:1264] relu1 <- ip1
[0] I0427 11:07:09.608098 125 net.cpp:1225] relu1 -> ip1 (in-place)
[0] I0427 11:07:09.608114 125 net.cpp:360] Setting up relu1
[0] I0427 11:07:09.608117 125 net.cpp:367] Top shape: 100 500 (50000)
[0] I0427 11:07:09.608121 125 net.cpp:375] Memory required for data: 8074800
[0] I0427 11:07:09.608141 125 layer_factory.hpp:114] Creating layer ip2
[0] I0427 11:07:09.608151 125 net.cpp:265] Creating Layer ip2
[0] I0427 11:07:09.608155 125 net.cpp:1264] ip2 <- ip1
[0] I0427 11:07:09.608160 125 net.cpp:1238] ip2 -> ip2
[0] I0427 11:07:09.608194 125 net.cpp:360] Setting up ip2
[0] I0427 11:07:09.608201 125 net.cpp:367] Top shape: 100 10 (1000)
[0] I0427 11:07:09.608203 125 net.cpp:375] Memory required for data: 8078800
[0] I0427 11:07:09.608207 125 layer_factory.hpp:114] Creating layer ip2_ip2_0_split
[0] I0427 11:07:09.608213 125 net.cpp:265] Creating Layer ip2_ip2_0_split
[0] I0427 11:07:09.608216 125 net.cpp:1264] ip2_ip2_0_split <- ip2
[0] I0427 11:07:09.608222 125 net.cpp:1238] ip2_ip2_0_split -> ip2_ip2_0_split_0
[0] I0427 11:07:09.608227 125 net.cpp:1238] ip2_ip2_0_split -> ip2_ip2_0_split_1
[0] I0427 11:07:09.608235 125 net.cpp:360] Setting up ip2_ip2_0_split
[0] I0427 11:07:09.608238 125 net.cpp:367] Top shape: 100 10 (1000)
[0] I0427 11:07:09.608242 125 net.cpp:367] Top shape: 100 10 (1000)
[0] I0427 11:07:09.608245 125 net.cpp:375] Memory required for data: 8086800
[0] I0427 11:07:09.608248 125 layer_factory.hpp:114] Creating layer accuracy
[0] I0427 11:07:09.608258 125 net.cpp:265] Creating Layer accuracy
[0] I0427 11:07:09.608261 125 net.cpp:1264] accuracy <- ip2_ip2_0_split_0
[0] I0427 11:07:09.608266 125 net.cpp:1264] accuracy <- label_mnist_1_split_0
[0] I0427 11:07:09.608273 125 net.cpp:1238] accuracy -> accuracy
[0] I0427 11:07:09.608281 125 net.cpp:360] Setting up accuracy
[0] I0427 11:07:09.608286 125 net.cpp:367] Top shape: (1)
[0] I0427 11:07:09.608289 125 net.cpp:375] Memory required for data: 8086804
[0] I0427 11:07:09.608292 125 layer_factory.hpp:114] Creating layer loss
[0] I0427 11:07:09.608299 125 net.cpp:265] Creating Layer loss
[0] I0427 11:07:09.608304 125 net.cpp:1264] loss <- ip2_ip2_0_split_1
[0] I0427 11:07:09.608307 125 net.cpp:1264] loss <- label_mnist_1_split_1
[0] I0427 11:07:09.608312 125 net.cpp:1238] loss -> loss
[0] I0427 11:07:09.608317 125 layer_factory.hpp:114] Creating layer loss
[0] I0427 11:07:09.608337 125 net.cpp:360] Setting up loss
[0] I0427 11:07:09.608345 125 net.cpp:367] Top shape: (1)
[0] I0427 11:07:09.608347 125 net.cpp:370] with loss weight 0.5
[0] I0427 11:07:09.608353 125 net.cpp:375] Memory required for data: 8086808
[0] I0427 11:07:09.608356 125 net.cpp:437] loss needs backward computation.
[0] I0427 11:07:09.608361 125 net.cpp:439] accuracy does not need backward computation.
[0] I0427 11:07:09.608368 125 net.cpp:437] ip2_ip2_0_split needs backward computation.
[0] I0427 11:07:09.608371 125 net.cpp:437] ip2 needs backward computation.
[0] I0427 11:07:09.608373 125 net.cpp:437] relu1 needs backward computation.
[0] I0427 11:07:09.608376 125 net.cpp:437] ip1 needs backward computation.
[0] I0427 11:07:09.608379 125 net.cpp:437] pool2 needs backward computation.
[0] I0427 11:07:09.608381 125 net.cpp:437] conv2 needs backward computation.
[0] I0427 11:07:09.608384 125 net.cpp:437] pool1 needs backward computation.
[0] I0427 11:07:09.608388 125 net.cpp:437] conv1 needs backward computation.
[0] I0427 11:07:09.608392 125 net.cpp:439] label_mnist_1_split does not need backward computation.
[0] I0427 11:07:09.608395 125 net.cpp:439] mnist does not need backward computation.
[0] I0427 11:07:09.608397 125 net.cpp:481] This network produces output accuracy
[0] I0427 11:07:09.608399 125 net.cpp:481] This network produces output loss
[0] I0427 11:07:09.608410 125 net.cpp:521] Network initialization done.
[1] I0427 11:07:09.612504 97 caffe.cpp:325] Configuring multinode setup
[0] I0427 11:07:09.608490 125 solver.cpp:121] Solver scaffolding done.
[1] I0427 11:07:09.612534 97 caffe.cpp:328] Starting Multi-node Optimization in MLSL environment
[0] I0427 11:07:09.608570 125 caffe.cpp:325] Configuring multinode setup
[1] W0427 11:07:09.612536 97 multi_sync.hpp:191] RUN: PER LAYER TIMINGS ARE DISABLED, FORWARD OVERLAP OPTIMIZATION IS ENABLED, WEIGHT GRADIENT COMPRESSION IS DISABLED, SINGLE DB SPLITTING IS DISABLED
[0] I0427 11:07:09.608588 125 caffe.cpp:328] Starting Multi-node Optimization in MLSL environment
[1] I0427 11:07:09.612552 97 multi_sync.hpp:134] synchronize_params: bcast
[0] W0427 11:07:09.608592 125 multi_sync.hpp:191] RUN: PER LAYER TIMINGS ARE DISABLED, FORWARD OVERLAP OPTIMIZATION IS ENABLED, WEIGHT GRADIENT COMPRESSION IS DISABLED, SINGLE DB SPLITTING IS DISABLED
[0] I0427 11:07:09.608605 125 multi_sync.hpp:134] synchronize_params: bcast
[0] I0427 11:07:09.609779 125 solver.cpp:397] Solving LeNet
[0] I0427 11:07:09.609793 125 solver.cpp:398] Learning Rate Policy: inv
[0] I0427 11:07:09.609819 125 multi_sync.hpp:134] synchronize_params: bcast
[1] I0427 11:07:09.614929 97 solver.cpp:397] Solving LeNet
[1] I0427 11:07:09.614940 97 solver.cpp:398] Learning Rate Policy: inv
[1] I0427 11:07:09.614959 97 multi_sync.hpp:134] synchronize_params: bcast
[0] I0427 11:07:09.611517 125 solver.cpp:474] Iteration 0, Testing net (#0)
[1] I0427 11:07:09.617259 97 solver.cpp:474] Iteration 0, Testing net (#0)
[0] I0427 11:07:09.798406 125 solver.cpp:563] Test net output #0: accuracy = 0.1318
[0] I0427 11:07:09.798449 125 solver.cpp:563] Test net output #1: loss = 2.41961 (* 1 = 2.41961 loss)
[0] I0427 11:07:09.814437 125 solver.cpp:312] Iteration 0, loss = 2.40042
[0] I0427 11:07:09.814477 125 solver.cpp:333] Train net output #0: loss = 2.40042 (* 1 = 2.40042 loss)
[0] I0427 11:07:09.814501 125 sgd_solver.cpp:215] Iteration 0, lr = 0.01
[1] I0427 11:07:09.820175 97 solver.cpp:312] Iteration 0, loss = 2.40914
[1] I0427 11:07:09.820214 97 solver.cpp:333] Train net output #0: loss = 2.40914 (* 1 = 2.40914 loss)
[1] I0427 11:07:09.820233 97 sgd_solver.cpp:215] Iteration 0, lr = 0.01
[1] I0427 11:07:10.099892 97 solver.cpp:707] Snapshot begin
[0] I0427 11:07:10.096343 125 solver.cpp:707] Snapshot begin
[1] I0427 11:07:10.102345 97 solver.cpp:734] Snapshot end
[1] I0427 11:07:10.102360 97 solver.cpp:443] Optimization Done.
[1] I0427 11:07:10.102368 97 caffe.cpp:345] Optimization Done.
[0] I0427 11:07:10.098636 125 solver.cpp:769] Snapshotting to binary proto file examples/mnist/lenet_mlsl_iter_50.caffemodel
[0] I0427 11:07:10.102581 125 sgd_solver.cpp:754] Snapshotting solver state to binary proto file examples/mnist/lenet_mlsl_iter_50.solverstate
[0] I0427 11:07:10.105306 125 solver.cpp:734] Snapshot end
[0] I0427 11:07:10.105319 125 solver.cpp:443] Optimization Done.
[0] I0427 11:07:10.105329 125 caffe.cpp:345] Optimization Done.
real 0m0.856s
user 0m0.043s
sys 0m0.033s
Result folder: /opt/caffe/result-20180427110708
Log without setting it:
root@jfz1r04h17:/opt/caffe# ./scripts/run_intelcaffe.sh --hostfile hosts --solver examples/mnist/lenet_solver_mlsl.prototxt --network tcp --netmask enp134s0f0
CPUs with optimal settings:
Intel Xeon E7-88/48xx, E5-46/26/16xx, E3-12xx, D15/D-15 (Broadwell)
Intel Xeon Phi 7210/30/50/90 (Knights Landing)
Intel Xeon Platinum 81/61/51/41/31xx (Skylake)
Settings:
CPU: skx
Host file: hosts
Running mode: train
Benchmark: none
Debug option: off
Engine:
Number of MLSL servers: -1
-1: selected automatically according to CPU model.
BDW/SKX: 2, KNL: 4
Solver file: examples/mnist/lenet_solver_mlsl.prototxt
LMDB data source: examples/mnist/mnist_train_lmdb
LMDB data source: examples/mnist/mnist_test_lmdb
Network: tcp
Netmask for TCP network: enp134s0f0
NUMA configuration: Flat mode.
Create result directory: /opt/caffe/result-20180427111108
Number of nodes: 2
MLSL_NUM_SERVERS: 2
MLSL_SERVER_AFFINITY: 6,7
Pin internal threads to: 70,71
Number of OpenMP threads: 34
Run caffe with 2 nodes...
Warning: cannot find sensors
[0] [0] MPI startup(): Intel(R) MPI Library, Version 2018 Update 1 Build 20171011 (id: 17941)
[0] [0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation. All rights reserved.
[0] [0] MPI startup(): Multi-threaded optimized library
[1] [1] ckpt_restart(): The real interface being used for tcp is enp134s0f0 and interface hostname is jfz1r04h19
[1] [1] MPI startup(): tcp data transfer mode
[0] [0] ckpt_restart(): The real interface being used for tcp is enp134s0f0 and interface hostname is jfz1r04h18
[0] [0] MPI startup(): tcp data transfer mode
[0] [0] MPI startup(): Device_reset_idx=5
[0] [0] MPI startup(): Allgather: 4: 27306-38912 & 0-2
[0] [0] MPI startup(): Allgather: 4: 78064-294912 & 0-2
[0] [0] MPI startup(): Allgather: 3: 0-27306 & 0-2
[0] [0] MPI startup(): Allgather: 3: 38912-78064 & 0-2
[0] [0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2
[0] [0] MPI startup(): Allgather: 1: 0-7 & 3-4
[0] [0] MPI startup(): Allgather: 1: 9-4607 & 3-4
[0] [0] MPI startup(): Allgather: 1: 66622-461338 & 3-4
[0] [0] MPI startup(): Allgather: 3: 9081-26350 & 3-4
[0] [0] MPI startup(): Allgather: 3: 461338-2692119 & 3-4
[0] [0] MPI startup(): Allgather: 4: 7-9 & 3-4
[0] [0] MPI startup(): Allgather: 4: 4607-9081 & 3-4
[0] [0] MPI startup(): Allgather: 4: 26350-66622 & 3-4
[0] [0] MPI startup(): Allgather: 4: 0-2147483647 & 3-4
[0] [0] MPI startup(): Allgather: 2: 1-1 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 2-3 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 4-5 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 6-26 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 27-98 & 5-2147483647
[0] [0] MPI startup(): Allgather: 3: 99-1029 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 1030-5572 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 5573-15186 & 5-2147483647
[0] [0] MPI startup(): Allgather: 2: 15187-33976 & 5-2147483647
[0] [0] MPI startup(): Allgather: 1: 33977-74391 & 5-2147483647
[0] [0] MPI startup(): Allgather: 3: 74392-131842 & 5-2147483647
[0] [0] MPI startup(): Allgather: 4: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2
[0] [0] MPI startup(): Allgatherv: 1: 0-2 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 2-7 & 3-4
[0] [0] MPI startup(): Allgatherv: 1: 7-49 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 49-113 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 113-149 & 3-4
[0] [0] MPI startup(): Allgatherv: 3: 149-915 & 3-4
[0] [0] MPI startup(): Allgatherv: 1: 915-1614 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 1614-3296 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 3296-5670 & 3-4
[0] [0] MPI startup(): Allgatherv: 1: 5670-10998 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 10998-185966 & 3-4
[0] [0] MPI startup(): Allgatherv: 3: 185966-381166 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 381166-1597083 & 3-4
[0] [0] MPI startup(): Allgatherv: 3: 1597083-2998114 & 3-4
[0] [0] MPI startup(): Allgatherv: 4: 0-2147483647 & 3-4
[0] [0] MPI startup(): Allgatherv: 2: 0-47 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 1: 47-103 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 103-438 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 2: 438-757 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 757-1453 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 2: 1453-3133 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 3133-6762 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 2: 6762-10802 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 10802-49917 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 49917-309996 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 4: 309996-3739157 & 5-2147483647
[0] [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Allreduce: 1: 804-1535 & 0-2
[0] [0] MPI startup(): Allreduce: 1: 2061-17116 & 0-2
[0] [0] MPI startup(): Allreduce: 2: 17116-37171 & 0-2
[0] [0] MPI startup(): Allreduce: 2: 344562-1048576 & 0-2
[0] [0] MPI startup(): Allreduce: 3: 37171-344562 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 0-804 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 1535-2061 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 1048576-3026207 & 0-2
[0] [0] MPI startup(): Allreduce: 4: 3026207-8388608 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 8388609-8635416 & 0-2
[0] [0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-2
[0] [0] MPI startup(): Allreduce: 7: 0-6 & 3-4
[0] [0] MPI startup(): Allreduce: 4: 6-11 & 3-4
[0] [0] MPI startup(): Allreduce: 7: 11-49 & 3-4
[0] [0] MPI startup(): Allreduce: 6: 49-321 & 3-4
[0] [0] MPI startup(): Allreduce: 2: 321-720 & 3-4
[0] [0] MPI startup(): Allreduce: 4: 720-1375 & 3-4
[0] [0] MPI startup(): Allreduce: 1: 1375-173904 & 3-4
[0] [0] MPI startup(): Allreduce: 2: 173904-318383 & 3-4
[0] [0] MPI startup(): Allreduce: 7: 318383-1512039 & 3-4
[0] [0] MPI startup(): Allreduce: 6: 1512039-2561761 & 3-4
[0] [0] MPI startup(): Allreduce: 4: 2561762-8388608 & 3-4
[0] [0] MPI startup(): Allreduce: 7: 8388609-10618873 & 3-4
[0] [0] MPI startup(): Allreduce: 8: 0-2147483647 & 3-4
[0] [0] MPI startup(): Allreduce: 1: 0-11 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 11-24 & 5-8
[0] [0] MPI startup(): Allreduce: 6: 24-42 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 42-107 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 107-178 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 178-310 & 5-8
[0] [0] MPI startup(): Allreduce: 2: 310-594 & 5-8
[0] [0] MPI startup(): Allreduce: 5: 594-4431 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 4431-54874 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 54874-91696 & 5-8
[0] [0] MPI startup(): Allreduce: 6: 91696-175538 & 5-8
[0] [0] MPI startup(): Allreduce: 4: 175538-383770 & 5-8
[0] [0] MPI startup(): Allreduce: 2: 383770-684262 & 5-8
[0] [0] MPI startup(): Allreduce: 3: 0-2147483647 & 5-8
[0] [0] MPI startup(): Allreduce: 1: 0-11 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 11-24 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 6: 24-42 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 1: 42-107 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 107-178 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 1: 178-310 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 2: 310-594 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 5: 594-4431 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 1: 4431-54874 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 54874-91696 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 6: 91696-175538 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 4: 175538-383770 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 2: 383770-32006608 & 9-2147483647
[0] [0] MPI startup(): Allreduce: 3: 0-2147483647 & 9-2147483647
[0] [0] MPI startup(): Alltoall: 3: 0-129493 & 0-2
[0] [0] MPI startup(): Alltoall: 3: 1080889-3453431 & 0-2
[0] [0] MPI startup(): Alltoall: 2: 129493-1080889 & 0-2
[0] [0] MPI startup(): Alltoall: 2: 0-2147483647 & 0-2
[0] [0] MPI startup(): Alltoall: 2: 0-2147483647 & 3-4
[0] [0] MPI startup(): Alltoall: 1: 1-64 & 5-2147483647
[0] [0] MPI startup(): Alltoall: 2: 65-572235 & 5-2147483647
[0] [0] MPI startup(): Alltoall: 4: 572236-1736997 & 5-2147483647
[0] [0] MPI startup(): Alltoall: 3: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Alltoallv: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Alltoallv: 2: 0-2147483647 & 3-4
[0] [0] MPI startup(): Alltoallv: 2: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] [0] MPI startup(): Barrier: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Barrier: 6: 0-2147483647 & 3-4
[0] [0] MPI startup(): Barrier: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Bcast: 7: 0-8 & 0-2
[0] [0] MPI startup(): Bcast: 7: 24-64 & 0-2
[0] [0] MPI startup(): Bcast: 7: 11264-52186 & 0-2
[0] [0] MPI startup(): Bcast: 7: 112045-131072 & 0-2
[0] [0] MPI startup(): Bcast: 7: 1048576-2097152 & 0-2
[0] [0] MPI startup(): Bcast: 1: 8-24 & 0-2
[0] [0] MPI startup(): Bcast: 1: 64-11264 & 0-2
[0] [0] MPI startup(): Bcast: 1: 52186-112045 & 0-2
[0] [0] MPI startup(): Bcast: 1: 131072-1048576 & 0-2
[0] [0] MPI startup(): Bcast: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Bcast: 1: 1-1 & 3-4
[0] [0] MPI startup(): Bcast: 5: 2-3 & 3-4
[0] [0] MPI startup(): Bcast: 1: 4-5 & 3-4
[0] [0] MPI startup(): Bcast: 6: 6-11 & 3-4
[0] [0] MPI startup(): Bcast: 5: 12-24 & 3-4
[0] [0] MPI startup(): Bcast: 4: 25-141 & 3-4
[0] [0] MPI startup(): Bcast: 7: 142-370 & 3-4
[0] [0] MPI startup(): Bcast: 3: 371-680 & 3-4
[0] [0] MPI startup(): Bcast: 4: 681-3894 & 3-4
[0] [0] MPI startup(): Bcast: 1: 3895-4494 & 3-4
[0] [0] MPI startup(): Bcast: 7: 4495-14778 & 3-4
[0] [0] MPI startup(): Bcast: 4: 14779-18223 & 3-4
[0] [0] MPI startup(): Bcast: 7: 18224-36738 & 3-4
[0] [0] MPI startup(): Bcast: 3: 0-2147483647 & 3-4
[0] [0] MPI startup(): Bcast: 1: 0-10 & 5-2147483647
[0] [0] MPI startup(): Bcast: 1: 175-16799 & 5-2147483647
[0] [0] MPI startup(): Bcast: 6: 10-32 & 5-2147483647
[0] [0] MPI startup(): Bcast: 6: 32-175 & 5-2147483647
[0] [0] MPI startup(): Bcast: 7: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] [0] MPI startup(): Gather: 2: 73643-172031 & 0-2
[0] [0] MPI startup(): Gather: 3: 0-853 & 0-2
[0] [0] MPI startup(): Gather: 3: 54613-73643 & 0-2
[0] [0] MPI startup(): Gather: 3: 262144-524288 & 0-2
[0] [0] MPI startup(): Gather: 1: 853-54613 & 0-2
[0] [0] MPI startup(): Gather: 1: 172031-262144 & 0-2
[0] [0] MPI startup(): Gather: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Gather: 2: 34148-129691 & 3-2147483647
[0] [0] MPI startup(): Gather: 2: 503316-2506634 & 3-2147483647
[0] [0] MPI startup(): Gather: 3: 0-34148 & 3-2147483647
[0] [0] MPI startup(): Gather: 3: 129691-503316 & 3-2147483647
[0] [0] MPI startup(): Gather: 3: 0-2147483647 & 3-2147483647
[0] [0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Gatherv: 1: 0-2147483647 & 3-4
[0] [0] MPI startup(): Gatherv: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 4: 0-5 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 5-26 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 3: 26-47 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 47-98 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 3: 98-188 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 188-362 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 2: 362-588 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 588-1951 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 3: 1951-11702 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 11702-23138 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 23138-58229 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 1: 58229-191964 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 2: 191964-2656092 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2
[0] [0] MPI startup(): Reduce_scatter: 4: 0-4 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 4-12 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 12-45 & 3-4
[1] [1] MPI startup(): Recognition=2 Platform(code=512 ippn=0 dev=4) Fabric(intra=6 inter=6 flags=0x0)
[0] [0] MPI startup(): Reduce_scatter: 1: 45-85 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 85-391 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 391-596 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 2: 596-1927 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 1927-2286 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 2286-7442 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 7442-10726 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 10726-45950 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 45950-101084 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 101084-159597 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 159597-423110 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 2: 423110-578734 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 578734-1329975 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 1: 1329975-4146461 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 3-4
[0] [0] MPI startup(): Reduce_scatter: 5: 0-5 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 5-28 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 28-50 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 3: 50-197 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 197-721 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 2: 721-3207 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 3207-5980 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 5980-11416 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 3: 11416-104215 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 104215-277330 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 3: 277330-630522 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 1: 630522-2659184 & 5-2147483647
[0] [0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Reduce: 4: 4-8 & 0-2
[0] [0] MPI startup(): Reduce: 3: 9-29 & 0-2
[0] [0] MPI startup(): Reduce: 2: 30-37 & 0-2
[0] [0] MPI startup(): Reduce: 3: 38-215 & 0-2
[0] [0] MPI startup(): Reduce: 2: 216-315 & 0-2
[0] [0] MPI startup(): Reduce: 5: 316-775 & 0-2
[0] [0] MPI startup(): Reduce: 2: 776-4045 & 0-2
[0] [0] MPI startup(): Reduce: 4: 4-6 & 3-4
[0] [0] MPI startup(): Reduce: 3: 7-11 & 3-4
[0] [0] MPI startup(): Reduce: 6: 12-16 & 3-4
[0] [0] MPI startup(): Reduce: 4: 17-34 & 3-4
[0] [0] MPI startup(): Reduce: 2: 35-99 & 3-4
[0] [0] MPI startup(): Reduce: 4: 100-230 & 3-4
[0] [0] MPI startup(): Reduce: 6: 231-275 & 3-4
[0] [0] MPI startup(): Reduce: 1: 276-1040 & 3-4
[0] [0] MPI startup(): Reduce: 3: 1041-3895 & 3-4
[0] [0] MPI startup(): Reduce: 6: 3896-4326 & 3-4
[0] [0] MPI startup(): Reduce: 3: 4327-10163 & 3-4
[0] [0] MPI startup(): Reduce: 1: 0-2147483647 & 3-4
[0] [0] MPI startup(): Reduce: 2: 4-26 & 5-2147483647
[0] [0] MPI startup(): Reduce: 4: 27-39 & 5-2147483647
[0] [0] MPI startup(): Reduce: 2: 40-230 & 5-2147483647
[0] [0] MPI startup(): Reduce: 3: 231-257 & 5-2147483647
[0] [0] MPI startup(): Reduce: 2: 258-718 & 5-2147483647
[0] [0] MPI startup(): Reduce: 3: 719-2436 & 5-2147483647
[0] [0] MPI startup(): Reduce: 4: 2437-6344 & 5-2147483647
[0] [0] MPI startup(): Reduce: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] [0] MPI startup(): Scatter: 1: 0-1 & 0-2
[0] [0] MPI startup(): Scatter: 1: 4-12 & 0-2
[0] [0] MPI startup(): Scatter: 1: 19-2048 & 0-2
[0] [0] MPI startup(): Scatter: 3: 2048-85701 & 0-2
[0] [0] MPI startup(): Scatter: 3: 165767-466939 & 0-2
[0] [0] MPI startup(): Scatter: 3: 524288-2336552 & 0-2
[0] [0] MPI startup(): Scatter: 2: 1-4 & 0-2
[0] [0] MPI startup(): Scatter: 2: 12-19 & 0-2
[0] [0] MPI startup(): Scatter: 2: 85701-165767 & 0-2
[0] [0] MPI startup(): Scatter: 2: 466939-524288 & 0-2
[0] [0] MPI startup(): Scatter: 2: 0-2147483647 & 0-2
[0] [0] MPI startup(): Scatter: 3: 0-1909200 & 3-2147483647
[0] [0] MPI startup(): Scatter: 2: 0-2147483647 & 3-2147483647
[0] [0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2
[0] [0] MPI startup(): Scatterv: 1: 0-2147483647 & 3-4
[0] [0] MPI startup(): Scatterv: 1: 0-2147483647 & 5-2147483647
[0] [0] MPI startup(): Rank Pid Node name Pin cpu
[0] [0] MPI startup(): 0 221 jfz1r04h18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
[0] 30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
[0] ,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71}
[0] [0] MPI startup(): 1 193 jfz1r04h19 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
[0] 30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56
[0] ,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71}
[0] [0] MPI startup(): Recognition=2 Platform(code=512 ippn=0 dev=4) Fabric(intra=6 inter=6 flags=0x0)
[0] [0] MPI startup(): I_MPI_COLL_INTRANODE=pt2pt
[0] [0] MPI startup(): I_MPI_DEBUG=6
[0] [0] MPI startup(): I_MPI_FABRICS=tcp
[0] [0] MPI startup(): I_MPI_FALLBACK=0
[0] [0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=hfi1_0:0,i40iw0:0,i40iw1:0
[0] [0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] [0] MPI startup(): I_MPI_PIN_MAPPING=1:0 0
[0] [0] MPI startup(): I_MPI_TCP_NETMASK=enp134s0f0
[0] [0] ckpt_restart(): The real interface being used for tcp is enp134s0f0 and interface hostname is jfz1r04h18
@zhang-xin Per ssh issue, did you have modify the file ~/.ssh/config
as
Host * Port 10010
I mean the ~/.ssh/config
just contain those two lines and needn't modify anymore. If not, could you try again under not comment the test_ssh_config
function?
@chuanqi129 thanks, it turns out to be MLSL issue, MLSL doesn't work well with IP in ~/.ssh/config, using only hostname works.
It's more like a MLSL issue not Intel Caffe's. I'll close this issue, thanks for your help!
Hi, I tried to run multinode training on multiple machines in docker following this wiki. But on master docker the run_intelcaffe.sh script stuck after the initial output. On other machines in each docker client container there's one caffe process with 100% cpu usage only. The training process then was stuck there, no error is reported and no more output on master. But if there's only 1 client the training is fine.
Following this wiki I was using lastest bvlc/caffe:intel_multinode image, which is Intel caffe 1.1.0. bvlc/caffe:intel doesn't work too which is 1.1.1a. ssh no password access is done and no firewall.
Below is all the output on master container, hosts file contains 2 clients, then the script was stuck, no error, no more output, while all clients had caffe process running 100% cpu.
Could you help point out what may be wrong or how could I debug this ? Thanks in advance.