iterative / studio-support

❓ DVC Studio Issues, Question, and Discussions
https://studio.iterative.ai
16 stars 1 forks source link

Missing or broken params and/or metrics file(s) #57

Closed courentin closed 2 years ago

courentin commented 2 years ago

When force-importing through studio, I got these error for all my commits:

Missing or broken params and/or metrics file(s): models/fr/common_voice/metrics/test/per_corpus.json; models/fr_debug/common_voice/metrics/test/per_corpus.json; models/fr/common_voice/metrics/valid/per_corpus.json; models/fr_debug/common_voice/metrics/valid/per_corpus.json; speech_to_text/models/hparams/dump_train.yaml

I've looked at all files on a specific commit and:

I know that some of these files were broken in previous commit, but I don't know how that could impact newer commits

Suor commented 2 years ago

If those files reside in remote then one possible scenario of them being missing if remote configuration or credentials don't work.

courentin commented 2 years ago

I can see their size in the Studio UI, so I guess studio can pull them?

Suor commented 2 years ago

Sizes are stored as a metadata in dvc files, so you don't need an access to cache to see those.

courentin commented 2 years ago

Hum, indeed! I ran some tests with my credentials, it seems that whatever AWS_ACCESS_KEY and SERCRET KEY I set, I always have the same error. Looking at the aws console, I see that my key was never used, it seems that studio does not even try to pull the files. Any ideas?

Suor commented 2 years ago

Need to look into more closely. For now I see PermissonErrors and FileNotFoundErrors happening while trying to collect metrics and params.

courentin commented 2 years ago

Seems like I have read access to the s3 dvc repository (s3:GetObject and s3:ListBucket), should I set other permissions? The "Default encryption" of my bucket is enabled as well.

UPDATE: I've set S3 full access to all our buckets and the error still appear after a force import

Suor commented 2 years ago

Still on it, if anyone has this question :)

Suor commented 2 years ago

@courentin I found the culprit. It's out.remote feature of dvc you are using. Studio does not currently support it.

I find it problematic to support in a general case. I.e. if you have metric/plot outputs with different remotes it may mean you need different credentials (or even different types of them: s3/gs/...) for different stuff. This means at least two things will need to be solved to support that:

On the the latter. DVC works usually with one commit at a time and cache is mostly present. Now it fetches absent metric and plot files one by one when cache is absent. Studio needs to do that in bulk because we do it for many commits at a time, all the metrics and empty cache. We have this implemented for a single global remote already but not for this scenario. @pared maybe I am wrong and DVC does have a bulk mechanism like that? Or maybe that could be added?

In the meantime I will explore supporting simpler case when all out.remote refer to the default/chosen remote anyway. This may be enough for many cases.

BTW, @courentin how/why do you use this feature?

courentin commented 2 years ago

Thank you for the investigation!

We use the multi-remote feature to comply with a data residency constraint we have (as described in https://github.com/iterative/studio-support/issues/50).

Each stage are parametrized and we duplicate the pipeline in multiple remotes, something like:

stages:
  train:
    foreach: languages
    do:
      cmd: python ..
      outs:
       - my_out:
         remote: ${key}

if you have metric/plot outputs with different remotes it may mean you need different credentials

In our use case this should not happen: all outputs of one stage all have the same remote. Ideally we'd like to specify the remote option at the stage level instead than the output level but it is not how dvc was design.

Unfortunately, it is a legal constraint that, we can't adjust ..

Suor commented 2 years ago

In our use case this should not happen: all outputs of one stage all have the same remote

So for different stages you still have different remotes? This raises both questions in the prev comment too then.

Studio parses all the stages and all the history. And for now we only work with one remote at a time in Studio.

courentin commented 2 years ago

So for different stages you still have different remotes?

Yes I do!

Got it! This will be pretty limiting for us if we can't work with multi-remotes.

Maybe the solution suggested in https://github.com/iterative/studio-support/issues/50#issuecomment-1207126034 would solve the issue. Would it work if Studio still continue to work one remote at a time and we handle multi-remote with multiple projects?

pared commented 2 years ago

DVC does have a bulk mechanism like that? Or maybe that could be added?

I will double-check that but considering that brancher is still a thing I am afraid you are not wrong.

Suor commented 2 years ago

Maybe the solution suggested in https://github.com/iterative/studio-support/issues/50#issuecomment-1207126034 would solve the issue. Would it work if Studio still continue to work one remote at a time and we handle multi-remote with multiple projects?

This is the one we are working on. The mechanism is there already, the UI is in the process.

For this particular issue, we'll need to update our parsing procedure. Particularly the cache collection/fetching part. I am working on it.

Suor commented 2 years ago

The fix for this was released, so metrics and plots with remote set to the same as default (.dvc/config:core.remote) will be collected now.

The case when out.remote is different from a default needs to wait until #50 is shipped. It will still only work if outs remote is the same as selected. I.e. we won't collect from multiple remotes in that case either.

I am closing this since it's working in my artificial scenario. Please try how does it work for you and reopen if needed.