Closed rickyyx closed 2 years ago
Still failing yesterday - https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_qC3ZfndQWYYjx2cz8KWGNUL4/clusters/ses_LmPgPzNA7TLdJ32Wmd1AwKdH?command-history-section=command_history .
Verified the change is taking effect on the run - https://github.com/ray-project/ray/commit/8fd6a5be144cf8fe21c4507807f306db69e67034 .
We had regression on prediction side with cluster env on commit fd01488 , node memory pattern from training to prediction is
Where our known, stable good run on commit c8dbbf3, node memory pattern from training to prediction is
As a result, 10 worker 100GB data prediction regressed:
From Results: {'training_time': 778.026989177, 'prediction_time': 306.5530205929999} To Results: {'training_time': 765.7612613899998, 'prediction_time': 501.24525937299995}
Both training and prediction for this release test used ~15GB more memory.
CC @amogkam (ML oncall) FYI for @jiaodong's finding.
@clarkzinzow
I did a few more release test bisection with prediction batch_size = 8192
Latency-wise, we're good with larger batch size on prediction side that e2e latency can be cut <200secs
run_xgboost_prediction takes 191.02347457200085 seconds.
run_xgboost_prediction takes 183.74121447799916 seconds.
But memory footprint suggests each node consistently used ~15 GB of RAM
compare to good commit on Oct 5th last Wed.
Root caused to https://github.com/ray-project/ray/pull/29103 cc: @clarng is this expected ?
Full bisection log see https://docs.google.com/document/d/1SfbHV5AFZe3P_VA_snDve6yeCh8cobVAydn7MRqRSIE/edit#
I think this is expected. There are several changes to the product + oss that is causing this
^ PR about to increase batch size should be all we need, all other investigations and discussions completed
What happened + What you expected to happen
Build failure Cluster
Versions / Dependencies
NA
Reproduction script
NA
Issue Severity
No response