mlcommons / power-dev

Dev repo for power measurement for the MLPerf™ benchmarks
https://mlcommons.org/en/groups/best-practices-power
Apache License 2.0
16 stars 24 forks source link

ptd is asked to stop measurement twice #287

Open psyhtest opened 1 year ago

psyhtest commented 1 year ago

I'm running into an error with both the server and the client on 1a16663 (i.e. the very last change before updating the supported PTD versions).

Server

After the ranging run, everything is fine - we change the Amp & Volt ranges accordingly.
ptd-server 2023-02-10 06:31:52,246 [INFO] Got command from the client 'session,2023-02-10_06-18-14,start,testing'
ptd-server 2023-02-10 06:31:52,246 [INFO] Sending to ptd: 'SR,V,120.190000'
02-10-2023 12:31:52.350: Completed test
02-10-2023 12:31:52.350: Avg watts 583.590709, min watts 172.960000, max watts 713.140000, samples 790, errors 0, valid 790
02-10-2023 12:31:52.350: Volt range set to 150.000000V
02-10-2023 12:31:53.851: Response to client sent: Range V changed
ptd-server 2023-02-10 06:31:53,852 [INFO] Reply from ptd: 'Range V changed'
ptd-server 2023-02-10 06:31:53,852 [INFO] Sending to ptd: 'SR,A,6.065300'
02-10-2023 12:31:53.852: Amp range set to 10.000000A
02-10-2023 12:31:55.354: Response to client sent: Range A changed
ptd-server 2023-02-10 06:31:55,354 [INFO] Reply from ptd: 'Range A changed'
Then we start testing.
ptd-server 2023-02-10 06:32:05,365 [INFO] Starting testing mode
ptd-server 2023-02-10 06:32:05,365 [INFO] Sending to ptd: 'Go,1000,0,2023-02-10_06-18-14_testing'
02-10-2023 12:32:05.365: Go with mark '2023-02-10_06-18-14_testing'
02-10-2023 12:32:05.365: Response to client sent: Starting untimed measurement, maximum 500000 samples at 1000ms with 0 rampup samples
ptd-server 2023-02-10 06:32:05,365 [INFO] Reply from ptd: 'Starting untimed measurement, maximum 500000 samples at 1000ms with 0 rampup samples'
ptd-server 2023-02-10 06:32:05,366 [INFO] Sending reply to client 'OK'

After 13 minutes, a 'Stop' command is sent to PTD.
ptd-server 2023-02-10 06:45:05,436 [INFO] Got command from the client 'session,2023-02-10_06-18-14,stop,testing'
ptd-server 2023-02-10 06:45:15,446 [INFO] Sending to ptd: 'Stop'
02-10-2023 12:45:15.447: Response to client sent: Stopping untimed measurement
ptd-server 2023-02-10 06:45:15,447 [INFO] Reply from ptd: 'Stopping untimed measurement'
ptd-server 2023-02-10 06:45:15,488 [INFO] Sending reply to client 'OK'
ptd-server 2023-02-10 06:45:15,493 [INFO] Got command from the client 'session,2023-02-10_06-18-14,done'
And then another 'Stop' command is sent to PTD, whose response 'Error: no measurement to stop' is sent to the client.
ptd-server 2023-02-10 06:45:15,493 [INFO] Sending to ptd: 'Stop'
02-10-2023 12:45:16.365: Completed test
02-10-2023 12:45:16.365: Avg watts 595.281669, min watts 188.000000, max watts 714.600000, samples 791, errors 0, valid 791
02-10-2023 12:45:16.365: Response to client sent: Error: no measurement to stop
ptd-server 2023-02-10 06:45:16,366 [INFO] Reply from ptd: 'Error: no measurement to stop'
I believe it is the above error that the client receives, while it expects an integer (see below).
We set the Volt & Amp ranges back to Auto.
ptd-server 2023-02-10 06:45:16,366 [INFO] Sending to ptd: 'SR,V,Auto'
02-10-2023 12:45:16.366: Volt range set to Auto
02-10-2023 12:45:16.869: Response to client sent: Range V changed
ptd-server 2023-02-10 06:45:16,869 [INFO] Reply from ptd: 'Range V changed'
ptd-server 2023-02-10 06:45:16,869 [INFO] Sending to ptd: 'SR,A,Auto'
02-10-2023 12:45:16.870: Ampere range set to Auto
02-10-2023 12:45:17.372: Response to client sent: Range A changed
ptd-server 2023-02-10 06:45:17,372 [INFO] Reply from ptd: 'Range A changed'
ptd-server 2023-02-10 06:45:17,373 [INFO] Set initial values for Amps Auto and Volts Auto
Why stopping PTD again and shutting connection down?
ptd-server 2023-02-10 06:45:17,373 [INFO] Stopping ptd...
02-10-2023 12:45:17.373: No data returned by socket read.
02-10-2023 12:45:17.373: Shutting connection down...
02-10-2023 12:45:17.373: Connection is down.
02-10-2023 12:45:17.373: Waiting for a connection...

Client

ValueError: invalid literal for int() with base 10: 'Error'
client 2023-02-10 06:45:15,488 [INFO] Sending command to the server: 'session,2023-02-10_06-18-14,done'
client 2023-02-10 06:45:17,375 [INFO] Got response: 'OK'
client 2023-02-10 06:45:17,375 [INFO] Fetching file '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/power/ptd_logs.txt'
client 2023-02-10 06:45:17,379 [INFO] Received '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/power/ptd_logs.txt'
client 2023-02-10 06:45:17,379 [INFO] Fetching file '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/ranging/spl.txt'
client 2023-02-10 06:45:17,381 [INFO] Received '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/ranging/spl.txt'
client 2023-02-10 06:45:17,381 [INFO] Fetching file '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/power/server.json'
client 2023-02-10 06:45:17,382 [INFO] Received '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/power/server.json'
client 2023-02-10 06:45:17,382 [INFO] Fetching file '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/power/server.log'
client 2023-02-10 06:45:17,384 [INFO] Received '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/power/server.log'
client 2023-02-10 06:45:17,384 [INFO] Fetching file '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/run_1/spl.txt'
client 2023-02-10 06:45:17,386 [INFO] Received '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/run_1/spl.txt'
client 2023-02-10 06:45:17,386 [INFO] Fetching file '/home/krai/CK_REPOS/ck-mlperf/program/mlperf-power-client/tmp/2023-02-10_06-18-14/run_1/ptd_out.txt'
Traceback (most recent call last):
  File "/home/krai/CK_TOOLS/mlperf-power-r3.0/power/ptd_client_server/client.py", line 27, in 
    client.main()
  File "/home/krai/CK_TOOLS/mlperf-power-r3.0/power/ptd_client_server/../ptd_client_server/lib/client.py", line 357, in main
    command.download(f"download,{session},{fname}", os.path.join(out_dir, fname))
  File "/home/krai/CK_TOOLS/mlperf-power-r3.0/power/ptd_client_server/../ptd_client_server/lib/client.py", line 63, in download
    self._server.recv_file(fname)
  File "/home/krai/CK_TOOLS/mlperf-power-r3.0/power/ptd_client_server/../ptd_client_server/lib/common.py", line 110, in recv_file
    chunk_len = int(line, 10)
ValueError: invalid literal for int() with base 10: 'Error'
psyhtest commented 1 year ago

I see similarly spurious requests to stop PTD when starting a new server:

Redirecting data output to file /tmp/tmp8c2gsf32/ptd_logfile.txt
Calculated PTD CRC: 0xed9a21d2, 7188608
02-10-2023 15:21:36.001: Attempting to connect to measurement device type 52...
02-10-2023 15:21:36.603: Analyzer identity response of 34 bytes: YOKOGAWA,WT333,C2SF14011V,F1.01EM

02-10-2023 15:21:52.917: Uncertainty checking for YokogawaWT330 is activated
02-10-2023 15:21:52.917: Connected to YokogawaWT330 successfully
02-10-2023 15:21:52.917: Establishing the listener on port 8888...
02-10-2023 15:21:52.917: Waiting for a connection...
ptd-server 2023-02-10 09:21:52,925 [INFO] Sending to ptd: 'Hello'
02-10-2023 15:21:52.925: Accepted connection from 127.0.0.1:48274
02-10-2023 15:21:52.926: Response to client sent: Hello, PTDaemon here!
ptd-server 2023-02-10 09:21:52,926 [INFO] Reply from ptd: 'Hello, PTDaemon here!'
ptd-server 2023-02-10 09:21:52,926 [INFO] Sending to ptd: 'Identify'
ptd-server 2023-02-10 09:21:52,927 [INFO] Reply from ptd: 'YokogawaWT330,1000,1,1,1,1,0,0,1,version=1.10.0-ed9a21d2-20220817,OS=Linux 5.4.0-120-generic #136~18.04.1-Ubuntu SMP Fri Jun 10 18:00:44 UTC 2022 x86_64,mode=power,1,1,1,0,0,no_cal_date,YOKOGAWA;WT333;C2SF14011V;F1.01EM'
ptd-server 2023-02-10 09:21:52,927 [INFO] Connected to PTDaemon
ptd-server 2023-02-10 09:21:52,927 [INFO] Sending to ptd: 'RR'
02-10-2023 15:21:52.927: Response to client sent: YokogawaWT330,1000,1,1,1,1,0,0,1,version=1.10.0-ed9a21d2-20220817,OS=Linux 5.4.0-120-generic #136~18.04.1-Ubuntu SMP Fri Jun 10 18:00:44 UTC 2022 x86_64,mode=power,1,1,1,0,0,no_cal_date,YOKOGAWA;WT333;C2SF14011V;F1.01EM
02-10-2023 15:21:52.928: Response to client sent: Ranges,0,5.000000,0,150.000000
ptd-server 2023-02-10 09:21:52,928 [INFO] Reply from ptd: 'Ranges,0,5.000000,0,150.000000'
ptd-server 2023-02-10 09:21:52,928 [INFO] Initial range for Amps is 5.000000 for Volts is 150.000000
ptd-server 2023-02-10 09:21:52,928 [INFO] Sending to ptd: 'Stop'
02-10-2023 15:21:52.928: Response to client sent: Error: no measurement to stop
ptd-server 2023-02-10 09:21:52,929 [INFO] Reply from ptd: 'Error: no measurement to stop'
ptd-server 2023-02-10 09:21:52,929 [INFO] Sending to ptd: 'SR,V,150.000000'
02-10-2023 15:21:52.929: Volt range set to 150.000000V
02-10-2023 15:21:54.431: Response to client sent: Range V changed
ptd-server 2023-02-10 09:21:54,431 [INFO] Reply from ptd: 'Range V changed'
ptd-server 2023-02-10 09:21:54,431 [INFO] Sending to ptd: 'SR,A,5.000000'
02-10-2023 15:21:54.432: Amp range set to 5.000000A
02-10-2023 15:21:55.933: Response to client sent: Range A changed
ptd-server 2023-02-10 09:21:55,934 [INFO] Reply from ptd: 'Range A changed'
ptd-server 2023-02-10 09:21:55,934 [INFO] Set initial values for Amps 5.000000 and Volts 150.000000
ptd-server 2023-02-10 09:21:55,934 [INFO] Stopping ptd...
02-10-2023 15:21:55.934: No data returned by socket read.
02-10-2023 15:21:55.934: Shutting connection down...
02-10-2023 15:21:55.934: Connection is down.
02-10-2023 15:21:55.934: Waiting for a connection...
ptd-server 2023-02-10 09:21:55,936 [INFO] Ready to accept connections at 10.222.146.209:4915
psyhtest commented 1 year ago

I see the same with the latest unmerged PR from OctoML:

02-10-2023 15:27:12.558: Uncertainty checking for YokogawaWT330 is activated
02-10-2023 15:27:12.558: Connected to YokogawaWT330 successfully
02-10-2023 15:27:12.558: Establishing the listener on port 8888...
02-10-2023 15:27:12.558: Waiting for a connection...
ptd-server 2023-02-10 09:27:12,562 [INFO] Sending to ptd: 'Hello'
02-10-2023 15:27:12.562: Accepted connection from 127.0.0.1:50180
02-10-2023 15:27:12.563: Response to client sent: Hello, PTDaemon here!
ptd-server 2023-02-10 09:27:12,563 [INFO] Reply from ptd: 'Hello, PTDaemon here!'
ptd-server 2023-02-10 09:27:12,564 [INFO] Sending to ptd: 'Identify'
02-10-2023 15:27:12.564: Response to client sent: YokogawaWT330,1000,1,1,1,1,0,0,1,version=1.10.0-ed9a21d2-20220817,OS=Linux 5.4.0-120-generic #136~18.04.1-Ubuntu SMP Fri Jun 10 18:00:44 UTC 2022 x86_64,mode=power,1,1,1,0,0,no_cal_date,YOKOGAWA;WT333;C2SF14011V;F1.01EM
ptd-server 2023-02-10 09:27:12,564 [INFO] Reply from ptd: 'YokogawaWT330,1000,1,1,1,1,0,0,1,version=1.10.0-ed9a21d2-20220817,OS=Linux 5.4.0-120-generic #136~18.04.1-Ubuntu SMP Fri Jun 10 18:00:44 UTC 2022 x86_64,mode=power,1,1,1,0,0,no_cal_date,YOKOGAWA;WT333;C2SF14011V;F1.01EM'
ptd-server 2023-02-10 09:27:12,565 [INFO] Connected to PTDaemon
ptd-server 2023-02-10 09:27:12,565 [INFO] Sending to ptd: 'RR'
02-10-2023 15:27:12.565: Response to client sent: Ranges,0,5.000000,0,150.000000
ptd-server 2023-02-10 09:27:12,565 [INFO] Reply from ptd: 'Ranges,0,5.000000,0,150.000000'
ptd-server 2023-02-10 09:27:12,566 [INFO] Initial range for Amps is 5.000000 for Volts is 150.000000
ptd-server 2023-02-10 09:27:12,566 [INFO] Sending to ptd: 'Stop'
02-10-2023 15:27:12.566: Response to client sent: Error: no measurement to stop
ptd-server 2023-02-10 09:27:12,566 [INFO] Reply from ptd: 'Error: no measurement to stop'
ptd-server 2023-02-10 09:27:12,567 [INFO] Sending to ptd: 'SR,V,150.000000'
02-10-2023 15:27:12.567: Volt range set to 150.000000V
02-10-2023 15:27:14.068: Response to client sent: Range V changed
ptd-server 2023-02-10 09:27:14,069 [INFO] Reply from ptd: 'Range V changed'
ptd-server 2023-02-10 09:27:14,069 [INFO] Sending to ptd: 'SR,A,5.000000'
02-10-2023 15:27:14.070: Amp range set to 5.000000A
02-10-2023 15:27:15.571: Response to client sent: Range A changed
ptd-server 2023-02-10 09:27:15,571 [INFO] Reply from ptd: 'Range A changed'
ptd-server 2023-02-10 09:27:15,572 [INFO] Set initial values for Amps 5.000000 and Volts 150.000000
ptd-server 2023-02-10 09:27:15,572 [INFO] Stopping ptd...
02-10-2023 15:27:15.572: No data returned by socket read.
02-10-2023 15:27:15.572: Shutting connection down...
02-10-2023 15:27:15.572: Connection is down.
02-10-2023 15:27:15.572: Waiting for a connection...
ptd-server 2023-02-10 09:27:15,574 [INFO] Ready to accept connections at 10.222.146.209:4915
arjunsuresh commented 1 year ago

@psyhtest with https://github.com/mlcommons/power-dev/pull/274 PR, an extra connection to PTD is done when we start the server. So the second response is expected. If this PR is the problem for the first issue, please feel free to revert it.

psyhtest commented 1 year ago

@arjunsuresh It's a good hint, but I'm afraid we have no time to debug which PR causes the workflow to fail.

arjunsuresh commented 1 year ago

I can do it, but I need to get to the power analyzer :)

arjunsuresh commented 1 year ago

@psyhtest Can you please confirm the commit hash on both the server and SUT machines for the power-dev repository where this error is obtained?

arjunsuresh commented 1 year ago

I'm getting the same error when I revert the server to 2.1 stage and client to the latest. This is expected as ptd_out.txt is added as a required file in a later commit and if the client requests this and the server is not aware, it'll fail.

client 2023-02-10 22:14:31,375 [INFO] Fetching file '/home/arjun/mypower_results/test_results/arjun_orin-reference-cpu-onnxruntime-v1.13.1-default_config/resnet50/offline/performance/tmp_power/run_1/ptd_out.txt'
Traceback (most recent call last):
  File "/media/arjun/T7/local/cache/89eac83e5c294242/power-dev/ptd_client_server/client.py", line 27, in <module>
    client.main()
  File "/media/arjun/T7/local/cache/89eac83e5c294242/power-dev/ptd_client_server/../ptd_client_server/lib/client.py", line 357, in main
    command.download(f"download,{session},{fname}", os.path.join(out_dir, fname))
  File "/media/arjun/T7/local/cache/89eac83e5c294242/power-dev/ptd_client_server/../ptd_client_server/lib/client.py", line 63, in download
    self._server.recv_file(fname)
  File "/media/arjun/T7/local/cache/89eac83e5c294242/power-dev/ptd_client_server/../ptd_client_server/lib/common.py", line 110, in recv_file
    chunk_len = int(line, 10)
ValueError: invalid literal for int() with base 10: 'Error'

Once I update the server repository to latest commit there are no errors.

client 2023-02-10 22:18:33,925 [INFO] Fetching file '/home/arjun/mypower_results/test_results/arjun_orin-reference-cpu-onnxruntime-v1.13.1-default_config/resnet50/offline/performance/tmp_power/run_1/ptd_out.txt'
client 2023-02-10 22:18:33,988 [INFO] Received '/home/arjun/mypower_results/test_results/arjun_orin-reference-cpu-onnxruntime-v1.13.1-default_config/resnet50/offline/performance/tmp_power/run_1/ptd_out.txt'
client 2023-02-10 22:18:33,988 [INFO] Sending command to the server: 'cleanup,2023-02-10_22-16-53'
client 2023-02-10 22:18:33,996 [INFO] Got response: 'OK'
client 2023-02-10 22:18:33,996 [INFO] Successful exit
arjunsuresh commented 1 year ago

@psyhtest @araghun

This is with the r3.0 branch which is same as r2.1 branch for the server/client codes. I'm getting the same error of double "stop" but it is not affecting the power measurements. So, this issue has nothing to do with any of the reverted PRs.

In fact the same error can be seen on Krai 2.1 submission logs like this

ptd-server 2023-02-11 02:23:36,080 [INFO] Got command from the client 'session,2023-02-11_01-59-08,stop,testing'

ptd-server 2023-02-11 02:23:46,090 [INFO] Sending to ptd: 'Stop'

ptd-server 2023-02-11 02:23:46,090 [INFO] Reply from ptd: 'Stopping untimed measurement'
ptd-server 2023-02-11 02:23:46,097 [INFO] Sending reply to client 'OK'
ptd-server 2023-02-11 02:23:46,111 [INFO] Got command from the client 'session,2023-02-11_01-59-08,done'

ptd-server 2023-02-11 02:23:46,111 [INFO] Sending to ptd: 'Stop'

02-11-2023 02:23:47.038: Avg watts 40.403611, min watts 15.040000, max watts 42.660000, samples 720, errors 0, valid 720

02-11-2023 02:23:47.038: Response to client sent: Error: no measurement to stop ptd-server 2023-02-11 02:23:47,038 [INFO] Reply from ptd: 'Error: no measurement to stop'

ptd-server 2023-02-11 02:23:47,038 [INFO] Sending to ptd: 'SR,V,600.000000'
02-11-2023 02:23:47.038: Volt range set to 600.000000V
02-11-2023 02:23:48.539: Response to client sent: Range V changed
ptd-server 2023-02-11 02:23:48,539 [INFO] Reply from ptd: 'Range V changed'
ptd-server 2023-02-11 02:23:48,539 [INFO] Sending to ptd: 'SR,A,0.100000'
02-11-2023 02:23:48.540: Amp range set to 0.100000A
ptd-server 2023-02-11 02:23:50,040 [INFO] Reply from ptd: 'Range A changed'
02-11-2023 02:23:50.040: Response to client sent: Range A changed
ptd-server 2023-02-11 02:23:50,041 [INFO] Set initial values for Amps 0.100000 and Volts 600.000000
ptd-server 2023-02-11 02:23:50,041 [INFO] Stopping ptd...
02-11-2023 02:23:50.041: No data returned by socket read.
02-11-2023 02:23:50.041: Shutting connection down...
02-11-2023 02:23:50.041: Connection is down.
arjunsuresh commented 1 year ago

@psyhtest The response to client sent message is also there in old logs- it is in ptd_logs.txt and not server.log.

07-18-2022 23:21:54.910: Response to client sent: Error: no measurement to stop

https://github.com/mlcommons/inference_results_v2.1/blob/master/closed/Krai/results/firefly-armnn-v22.05-neon/resnet50/offline/performance/power/ptd_logs.txt#L1364

arjunsuresh commented 1 year ago

Having said that this issue is certainly a bug and needs to be resolved.

arjunsuresh commented 1 year ago

@psyhtest This issue can be closed right?