kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

[backend] the cache server does not check whether the cached artifacts have been deleted #7939

Open juliusvonkohout opened 2 years ago

juliusvonkohout commented 2 years ago

Environment

Kubeflow 1.5.1

Steps to reproduce

The cache server wrongly states that the step is cached even if the artifacts have been deleted from S3. We propose https://github.com/kubeflow/pipelines/pull/7938 that checks whether the folder actually still exists on S3. If the folder does not exist we do not wrongfully pretend that it is cached and instead delete the cachedb entry and let the pipeline step run again.


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

chensun commented 2 years ago

I'm not entirely sure about v1 behavior, but in v2, we don't make any assumption on whether an artifact has content in its path. For instance, Metrics artifact has nothing in its cloud storage path but only metadata in the MLMD database. So it's by design that we don't check the "existence" of artifacts with or without caching.

You could disable caching when submitting a pipeline run though, the SDK client has a parameter for that.

juliusvonkohout commented 2 years ago

I'm not entirely sure about v1 behavior, but in v2, we don't make any assumption on whether an artifact has content in its path. For instance, Metrics artifact has nothing in its cloud storage path but only metadata in the MLMD database. So it's by design that we don't check the "existence" of artifacts with or without caching.

You could disable caching when submitting a pipeline run though, the SDK client has a parameter for that.

This is a fix for v1 behavior only. I do not know about v2 behavior. The pipeline just breaks, if the S3 content is deleted so somehow that must be fixed. Do you have another proposal?

chensun commented 2 years ago

No, I think it's by design. If users deletes artifacts, shouldn't that be considered user error? The solution can be run again with cache option disabled.

MatthiasCarnein commented 2 years ago

I believe that @juliusvonkohout raised a very important issue here. In many scenarios users are legally required to cleanup old records and keeping the cachedb in sync is definitely a pain point.

In the past, the kfp project has also not considered this a user error and it was even recommended as best practice for production grade deployments by @Bobgy (https://github.com/kubeflow/pipelines/issues/6204#issuecomment-903011330).

juliusvonkohout commented 2 years ago

I believe that @juliusvonkohout raised a very important issue here. In many scenarios users are legally required to cleanup old records and keeping the cachedb in sync is definitely a pain point.

In the past, the kfp project has also not considered this a user error and it was even recommended as best practice for production grade deployments by @Bobgy (#6204 (comment)).

Yes, @chensun please reopen, it is clearly a bug.

juliusvonkohout commented 2 years ago

@chensun as a follow up to the meeting i do not see a global setting here https://www.kubeflow.org/docs/components/pipelines/overview/caching/

chensun commented 2 years ago

Discussed this in today KFP community meeting, and I want to summarize it here:

In the past, the kfp project has also not considered this a user error and it was even recommended as best practice for production grade deployments by @Bobgy (https://github.com/kubeflow/pipelines/issues/6204#issuecomment-903011330).

Yes, @chensun please reopen, it is clearly a bug.

First of all, what you quoted from @Bobgy (https://github.com/kubeflow/pipelines/issues/6204#issuecomment-903011330) doesn't conflict with what I said above.

I didn't mean the action of deleting artifacts from cloud storage itself is a user error. My reply is within the specific context of this issue, deleting the artifacts while replying on caching to continue working is a user error. It's like in C, you have a pointer that points to some allocated memory, freeing the memory is absolutely normal, but keep accessing the pointer would be a user error.

If users would delete artifacts periodically, they should probably configure pipelines runs with a cache staleness setting (understood it's different from cache lifecycle but I think still solves this use case). Or they can disable caching for new runs after artifact deletion. From this point, this is a not a bug.

I'm against the proposed fix because as I mentioned an artifact has a remote storage URI and metadata, both are optional. We can't assume there's content in the URI, and I'm not sure if we can assume the URI always exists even if users never write to it--it's rather an implementation detail that may vary for different platforms. We could check and verify this, but I don't think cache server should care at all an external system.

I personally would prefer to improve the caching configuration story--if the existing caching setting is not that obvious or easy to use, maybe we can add a global config. Or as @connor-mccarthy mentioned, maybe we provide a way for users to delete cache entires in the database--it could be as simple as a script to begin with as long as it's well documented.

MatthiasCarnein commented 2 years ago

I think the difference is that the user is not actively "relying on caching". For the user it just happens automatically when using the same component definition and arguments. Having this trigger automatically but then fail is confusing for users.

If I understand you correctly, you are suggesting that a user should manually disable caching when the corresponding artifact was deleted. But how would that work in practice? Does the user have to figure out the component definition and arguments that produced the artifact and note them down to disable caching whenever accidentally using the same definitions again in the future? It would be impossible to keep track of all items that the cache server still considers cached but are already deleted from storage (the cachedb is always growing and never cleaned after all).

I agree that there are other approaches to solve this problem. For example you could have an option somewhere in the UI to delete an artifact which would clean the artifact from S3, mlmd and the cachedb (but then again that's not supported by mlmd as far as I know).

But recommending to delete artifacts from S3 is incompatible with the current caching behaviour in my eyes.

juliusvonkohout commented 2 years ago

I personally would prefer to improve the caching configuration story--if the existing caching setting is not that obvious or easy to use, maybe we can add a global config. Or as @connor-mccarthy mentioned, maybe we provide a way for users to delete cache entires in the database--it could be as simple as a script to begin with as long as it's well documented.

We cannot expect the user to set it for every pipeline himself. That is too error prone as @MatthiasCarnein explains. We need a global expiration setting (maximum cache lifetime), for example an environment variable in the cache server that is respected by V1 and V2 pipelines. This setting could also enforce the deletion of old database entries according to the expiration setting. So we could reuse the parts of my PR that clean the database entries (cachedb and maybe mlmd for v2 ). Then we can use a lifecyclepolicy on the object storage with the same timeframe. The user can use a cache staleness up to the cache expiration duration then. Iit would be a user error then if he manually deletes Artifacts before the lifecycle policies does.

Does this sound reasonable to you?

chensun commented 2 years ago

I think the difference is that the user is not actively "relying on caching". For the user it just happens automatically when using the same component definition and arguments. Having this trigger automatically but then fail is confusing for users.

That's a fair argument. On the other hand though, I don't think it's too much to ask for from the users to understand that KFP has caching on by default. After all, I perceive KFP as an advanced developer tool. Users probably should understand the possible impacts of deleting artifacts.

If I understand you correctly, you are suggesting that a user should manually disable caching when the corresponding artifact was deleted. But how would that work in practice? Does the user have to figure out the component definition and arguments that produced the artifact and note them down to disable caching whenever accidentally using the same definitions again in the future?

In practice, how do users deleting artifacts? I guess they don't cherry-pick specific artifacts to delete but rather delete artifacts in bulk under some root path. Then they could disable caching for all their pipelines or a set of them. If they do delete artifacts in a precise cherry-picking manner, they probably tracks what pipelines and inputs produced those artifacts already.

I agree that there are other approaches to solve this problem. For example you could have an option somewhere in the UI to delete an artifact which would clean the artifact from S3, mlmd and the cachedb (but then again that's not supported by mlmd as far as I know).

But recommending to delete artifacts from S3 is incompatible with the current caching behaviour in my eyes.

Indeed, if KFP provides the feature for user to delete artifacts, which is what Bobgy suggested in the other thread, then it should be responsible for cleaning up all the related items that should and can be deleted--cache entry definitely falls into this category. But the current state is KFP doesn't offer such feature, and users are deleting artifacts on their own, which happens outside the KFP system. I don't this KFP should be required to respond to external actions.

In summary, I'm supportive of the following options:

juliusvonkohout commented 2 years ago

In summary, I'm supportive of the following options:

* Make artifact deleting a KFP feature so that KFP should do the full cleanup

* Provide a script or API for user to delete cache entry, so they can easily do so after they delete artifacts on their own.

* Make a global caching config, so users can set cache lifecycle/staleness that matches their artifact deletion policy or need.

Shall i implement the last item or do you want do it? It would just be an environment variable MAX_CACHE_STALENESS in the cache-server derived from the configmap pipeline-install-config that uses the same syntax as per pipeline cache settings (e.g. P30D). Then in the cache server we modify https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/server/mutation.go#L120-L129 and https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/server/watcher.go#L129-L136 to read the environment variable instead of -1.

For example with

    if exists { 
        maxCacheStalenessInSeconds = getMaxCacheStaleness(maxCacheStaleness) 
    } 
    else {
        maxCacheStalenessInSeconds = getMaxCacheStaleness("pod Environment variable maxCacheStaleness") 
    }

Then in https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L38-L47 and https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L79-L93

it will just be properly used. So we add three lines and an environment variable via the global configmap to satisfy most of the use cases.

If desired we could also add a CACHE_EXPIRATION variable and delete the database entry with https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L137-L143 if it is too old.

For example after https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L61-L89 we could add

        if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
            ...
        }
                if s.time.Now().UTC().Unix()-startedAtInSec > getMaxCacheStaleness("CACHE_EXPIRATION variable")  {
                    err = executionCacheStore.DeleteExecutionCache(id)
                }

Maybe it would even be better if we would just periodically execute an SQL command that cleans all entries older than CACHE_EXPIRATION in the cachedb table.

juliusvonkohout commented 2 years ago

And is there a documentation for V2 caching? https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/server/mutation.go#L97-L101

TobiasGoerke commented 2 years ago

Kubeflow definitely misses the ability to manage artifacts and I'm glad I am not the only one thinking so (also see #5783). I'd love to see invalidating the cache and deleting artifacts in the UI become a feature.

However, MLMD doesn't seem to support deleting artifacts anytime soon and it wouldn't help us managing Kubeflow's cache, anyways.

Thus, I've written a short PoC suggesting a way this feature could be implemented. Its integration should be self-explanatory:

kubeflow_cache

Code and concept are still in a very early stage. For more info on how the backend manages the cache, see here.

I'm looking for feedback and discussions on this proposal and will gladly contribute to implementing it.

chensun commented 2 years ago

In summary, I'm supportive of the following options:

* Make artifact deleting a KFP feature so that KFP should do the full cleanup

* Provide a script or API for user to delete cache entry, so they can easily do so after they delete artifacts on their own.

* Make a global caching config, so users can set cache lifecycle/staleness that matches their artifact deletion policy or need.

Shall i implement the last item or do you want do it? It would just be an environment variable MAX_CACHE_STALENESS in the cache-server derived from the configmap pipeline-install-config that uses the same syntax as per pipeline cache settings (e.g. P30D). Then in the cache server we modify

https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/server/mutation.go#L120-L129

and https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/server/watcher.go#L129-L136

to read the environment variable instead of -1. For example with

    if exists { 
        maxCacheStalenessInSeconds = getMaxCacheStaleness(maxCacheStaleness) 
    } 
    else {
        maxCacheStalenessInSeconds = getMaxCacheStaleness("pod Environment variable maxCacheStaleness") 
    }

Then in

https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L38-L47

and https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L79-L93

it will just be properly used. So we add three lines and an environment variable via the global configmap to satisfy most of the use cases.

If desired we could also add a CACHE_EXPIRATION variable and delete the database entry with

https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L137-L143

if it is too old. For example after

https://github.com/kubeflow/pipelines/blob/555447f5c6453609c59db1a890cca1ec38156847/backend/src/cache/storage/execution_cache_store.go#L61-L89

we could add

      if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
          ...
      }
                if s.time.Now().UTC().Unix()-startedAtInSec > getMaxCacheStaleness("CACHE_EXPIRATION variable")  {
                    err = executionCacheStore.DeleteExecutionCache(id)
                }

Maybe it would even be better if we would just periodically execute an SQL command that cleans all entries older than CACHE_EXPIRATION in the cachedb table.

@juliusvonkohout Thank you for the write up. Given this is a nontrivial user interface change. I would suggest you write a design doc so that we can conduct a proper design review by our team and the community.

chensun commented 2 years ago

Kubeflow definitely misses the ability to manage artifacts and I'm glad I am not the only one thinking so (also see #5783). I'd love to see invalidating the cache and deleting artifacts in the UI become a feature.

However, MLMD doesn't seem to support deleting artifacts anytime soon and it wouldn't help us managing Kubeflow's cache, anyways.

Thus, I've written a short PoC suggesting a way this feature could be implemented. Its integration should be self-explanatory:

kubeflow_cache kubeflow_cache

Code and concept are still in a very early stage. For more info on how the backend manages the cache, see here.

I'm looking for feedback and discussions on this proposal and will gladly contribute to implementing it.

Thanks @TobiasGoerke , and same suggestion as writing a design doc so that we can review it properly.

TobiasGoerke commented 2 years ago

Kubeflow definitely misses the ability to manage artifacts and I'm glad I am not the only one thinking so (also see #5783). I'd love to see invalidating the cache and deleting artifacts in the UI become a feature. However, MLMD doesn't seem to support deleting artifacts anytime soon and it wouldn't help us managing Kubeflow's cache, anyways. Thus, I've written a short PoC suggesting a way this feature could be implemented. Its integration should be self-explanatory: kubeflow_cache

    [

        ![kubeflow_cache](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

    ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

      [

      ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

   [ ![kubeflow_cache](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif) ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

    [

        ![kubeflow_cache](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

    ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

      [

      ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

   [ ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

Code and concept are still in a very early stage. For more info on how the backend manages the cache, see here. I'm looking for feedback and discussions on this proposal and will gladly contribute to implementing it.

Thanks @TobiasGoerke , and same suggestion as writing a design doc so that we can review it properly.

I've created a draft. What's the workflow for proposals? Where to post this proposal? Thanks.

juliusvonkohout commented 2 years ago

@TobiasGoerke just write me or Diana atanasova on the kubeflow slack. We have some experience with those drafts.

IronPan commented 2 years ago

In summary, I'm supportive of the following options:

* Make artifact deleting a KFP feature so that KFP should do the full cleanup

* Provide a script or API for user to delete cache entry, so they can easily do so after they delete artifacts on their own.

* Make a global caching config, so users can set cache lifecycle/staleness that matches their artifact deletion policy or need.

Is job level caching config an option? The global caching config might be inflexible if different pipelines artifacts have different TTL? Especially if multiple users are sharing the same environment.

juliusvonkohout commented 2 years ago

@IronPan

You can still set a per pipeline and per step cache staleness. Please check https://github.com/kubeflow/pipelines/pull/8270#discussion_r980508499 and the design document linked there.

juliusvonkohout commented 2 years ago

Sadly we have to be agnostic of the backend until the new unified storage architecture from https://github.com/kubeflow/pipelines/pull/7725#issuecomment-1277334000 is there. So i got https://github.com/kubeflow/pipelines/pull/8270 in instead of #7938 . V2 pipelines are not yet covered.

juliusvonkohout commented 2 years ago

V2 is discussed in https://docs.google.com/document/d/1_Hy1_KvuKh-heydg8qIyXYTiMfNPxePx4BHGpZT4xMk/edit?resourcekey=0-PTtkRPF5xpS3lD8dssqX1Q#

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 5 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

AndersBennedsgaard commented 5 months ago

@juliusvonkohout could we have this opened again?

juliusvonkohout commented 5 months ago

@AndersBennedsgaard

/reopen

google-oss-prow[bot] commented 5 months ago

@juliusvonkohout: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/7939#issuecomment-2157595433): >@AndersBennedsgaard > >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
juliusvonkohout commented 5 months ago

/lifecycle frozen