sustainable-computing-io / kepler-model-server

Model Server for Kepler
Apache License 2.0
25 stars 25 forks source link

Model training fails for Kepler release-0.7.11 #324

Closed vprashar2929 closed 3 months ago

vprashar2929 commented 3 months ago

What happened?

When trying to train the model locally in case of Kepler with release-0.7.11 is deployed it fails with the below trace

valid feature group:  [<FeatureGroup.CounterOnly: 3>, <FeatureGroup.BPFOnly: 5>, <FeatureGroup.BPFIRQ: 9>, <FeatureGroup.CounterIRQCombined: 7>]
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
test-pipe pipeline: CounterOnly extraction done.
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/home/fserver/repos/kepler-model-server-vp/src/train/extractor/extractor.py:234: RuntimeWarning: Mean of empty slice.
  time_diff_values = df.reset_index()[[TIMESTAMP_COL]].diff().dropna().values.mean()
/home/fserver/.local/share/hatch/env/virtual/kepler-model-server/_v8KwWUN/kepler-model-server/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
test-pipe pipeline: CounterOnly isolation done.
Traceback (most recent call last):
  File "/home/fserver/repos/kepler-model-server-vp/model_training/../cmd/main.py", line 924, in <module>
    getattr(sys.modules[__name__], args.command)(args)
  File "/home/fserver/repos/kepler-model-server-vp/model_training/../cmd/main.py", line 417, in train
    assert success, "failed to process pipeline {}".format(pipeline.name)
AssertionError: failed to process pipeline test-pipe

Command used to run train:

PIPELINE_NAME=test-pipe 
COLLECT_ID=fserver

DATAPATH=$(pwd)/data MODEL_PATH=$(pwd)/data python ../cmd/main.py train --pipeline-name $PIPELINE_NAME --input kepler_query --id $COLLECT_ID

Verified the same with Kepler release-0.7.7 and it works fine.

What did you expect to happen?

Model training should work fine irrespective of the Kepler version unless there is a specific change between 0.7.7 and 0.7.11

How can we reproduce it (as minimally and precisely as possible)?

  1. Setup and deploy Kepler using script.sh locally
  2. Collect the metric
  3. Run the train locally

Anything else we need to know?

No response

Kepler image tag

release-0.7.11

Deployment

Kepler model server image tag if deployed

Kepler estimator image tag if deployed

Kepler online trainer image tag if deployed

Kepler offline trainer image tag if deployed

Kepler profiler image tag if deployed

Kubernetes version

```console $ kubectl version # paste output here ```

Install tools

Kepler deployment config


  BIND_ADDRESS: 0.0.0.0:9102
  CGROUP_METRICS: '*'
  CPU_ARCH_OVERRIDE: ""
  ENABLE_EBPF_CGROUPID: "true"
  ENABLE_GPU: "true"
  ENABLE_PROCESS_METRICS: "false"
  EXPOSE_CGROUP_METRICS: "true"
  EXPOSE_HW_COUNTER_METRICS: "true"
  EXPOSE_IRQ_COUNTER_METRICS: "true"
  KEPLER_LOG_LEVEL: "1"
  KEPLER_NAMESPACE: kepler
  MAX_LOOKUP_RETRY: "1000"
  METRIC_PATH: /metrics
  MODEL_CONFIG: |
    CONTAINER_COMPONENTS_ESTIMATOR=false
  REDFISH_PROBE_INTERVAL_IN_SECONDS: "60"
  REDFISH_SKIP_SSL_VERIFY: "true"
sunya-ch commented 3 months ago

Found the point... The latest kepler gave us this power source... It is supposed to be intel_rapl or acpi.

Screenshot 2024-07-25 at 9 46 03
sunya-ch commented 3 months ago

@sthaha Also share that for his case, source is "rapl-sysfs" which is not supported yet in model-server side. I will check the latest source label and update the train_types

vprashar2929 commented 3 months ago

closing as #325 is now merged.