[release][CI] air_benchmark_xgboost_cpu_10 failure

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.92k stars 5.77k forks source link

[release][CI] air_benchmark_xgboost_cpu_10 failure #28974

Closed rickyyx closed 2 years ago

rickyyx commented 2 years ago

What happened + What you expected to happen

Build failure Cluster

run_xgboost_prediction takes 531.6400554740001 seconds.
Results: {'training_time': 793.1882077000001, 'prediction_time': 531.6400554740001}
Traceback (most recent call last):
  File "workloads/xgboost_benchmark.py", line 153, in <module>
    main(args)
  File "workloads/xgboost_benchmark.py", line 134, in main
    f"Batch prediction on XGBoost is taking {prediction_time} seconds, "
RuntimeError: Batch prediction on XGBoost is taking 531.6400554740001 seconds, which is longer than expected (450 seconds).

Versions / Dependencies

Reproduction script

Issue Severity

No response

c21 commented 2 years ago

Still failing now - https://buildkite.com/ray-project/release-tests-branch/builds/1090#01839fe4-545c-4598-9c70-0a0eb95e6df3 .

c21 commented 2 years ago

Still failing yesterday - https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_LmPgPzNA7TLdJ32Wmd1AwKdH?command-history-section=command_history .

Verified the change is taking effect on the run - https://github.com/ray-project/ray/commit/8fd6a5be144cf8fe21c4507807f306db69e67034 .

jiaodong commented 2 years ago

We had regression on prediction side with cluster env on commit fd01488 , node memory pattern from training to prediction is

Where our known, stable good run on commit c8dbbf3, node memory pattern from training to prediction is

As a result, 10 worker 100GB data prediction regressed:

From Results: {'training_time': 778.026989177, 'prediction_time': 306.5530205929999} To Results: {'training_time': 765.7612613899998, 'prediction_time': 501.24525937299995}

Both training and prediction for this release test used ~15GB more memory.

c21 commented 2 years ago

CC @amogkam (ML oncall) FYI for @jiaodong's finding.

jiaodong commented 2 years ago

@clarkzinzow

I did a few more release test bisection with prediction batch_size = 8192

Latency-wise, we're good with larger batch size on prediction side that e2e latency can be cut <200secs

run_xgboost_prediction takes 191.02347457200085 seconds.
run_xgboost_prediction takes 183.74121447799916 seconds.

But memory footprint suggests each node consistently used ~15 GB of RAM

compare to good commit on Oct 5th last Wed.

jiaodong commented 2 years ago

Root caused to https://github.com/ray-project/ray/pull/29103 cc: @clarng is this expected ?

Full bisection log see https://docs.google.com/document/d/1SfbHV5AFZe3P_VA_snDve6yeCh8cobVAydn7MRqRSIE/edit#

clarng commented 2 years ago

I think this is expected. There are several changes to the product + oss that is causing this

product added memory limit to the container and we have less memory per node now (64 GiB -> 57 GiB). This could contribute to increased spilling as we allocate 30% node memory to the object store memory, which became a smaller number after adding a memory limit
the test is ingesting 100GB of data on 10 nodes. It is expected that the memory usage on each node is > 10 GiB as it needs to process and store the results to the object store in addition to using the heap. 3-5 GiB per node doesn't look like the right amount of activity memory per node.

jiaodong commented 2 years ago

^ PR about to increase batch size should be all we need, all other investigations and discussions completed

clarkzinzow commented 2 years ago

@jiaodong With this PR merged, and release tests passing on both the PR and in master, I'm closing this as fixed.