opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
89 stars 126 forks source link

[BUG] ML model not in cache. Remove all of its cache files. model id #2158

Open manzke opened 6 months ago

manzke commented 6 months ago

When using ML commons and Docker I'm hitting a problem, which is either a bug or not documented.

Setup:

Whenever the docker restarts, the model is not loaded. That's fine, this is where the auto redeploy should kick in. It seems the service is trying to do it, but all I get is "ML model not in cache. Remove all of its cache files. model id:".

I tried:

Both works until the instance is restarted. One possible bug: I saw a few posts about a bug, where the

Are there any other information written / stored? (.huggingfaces?, ...) If not I guess it is all because of 844.

zane-neo commented 6 months ago

@manzke Confirm, the issue you encountered is: when docker restarts, the model is not auto redeployed?

manzke commented 6 months ago

I can confirm:

zane-neo commented 6 months ago

@manzke I can partially reproduce the issue: trying to deploy, failed, deploy, failed. & a NPE for mltask., this part I was not able to reproduce but the model is not in cache... I can reproduce. I used a single instance to reproduce this so I assume you're using the same. There're several issue:

  1. Before 2.12, the model auto redeploy doesn't support cluster level restart model auto redeploy, that's why the model auto redeploy didn't come in.
  2. And the ML model not in cache. Remove all of its cache files... part is triggered by cronjob which is to sync up all the running model status(which nodes run which models) across the cluster. This isn't an error log instead it's only a clean up.
  3. And when you try to manually deploy the model it says model content changed, this is a bug in the above cronjob, the cronjob always delete the model cache files but on MacOS, the delete operation failed quietly because the code lacking permission to delete files on MacOS. I have created this PR: https://github.com/opensearch-project/ml-commons/pull/2180 to fix this. Currently there isn't a simple workaround for this, but I think you can try use docker without mapping data folder to MacOS, this case the deletion might go through without the permission check.
manzke commented 6 months ago

I'll check it out. Actually it wasn't Mac related. I could reproduce it in an aws linux instance (fargate) using efs storage. Having the same copy on a local server with local storage mapped into the docker container had no issues at all. You are right we have a single instance. Wasn't even trying model distribution yet 😁

We will first upgrade to 2.12. and will test your fix afterwards.

zane-neo commented 5 months ago

Sure, this should be a permission issue and the fix should work, 2.12 is released already so the fix won't be in it but you can try 2.13 which will be released soon.

manzke commented 5 months ago

@zane-neo updated to 2.12. still massive issues with model loading. while uploading a model and trying to deploy it, it tries to download stuff from hugging faces. Only after it failed, the task timed out, it actually switches the state. os-2.12.log

zane-neo commented 5 months ago

@manzke , 2.12 won't have this fix since it released before this fix code merge, 2.13 code should fix this. BTW, it seems your attached log not in utf-8 encoding so I'm not able to open it correctly.

manzke commented 5 months ago

Thanks. Will change the encoding and upload again. The advantage of 2.12. is at least I can redeploy the model. This wasn't possible with 2.11.

Means I get the cache error which goes away after it has been deployed again.

dblock commented 2 months ago

@zane-neo Was this fixed in a newer version?

Catch All Triage - 1 2 3 4 5