Closed courentin closed 2 years ago
If those files reside in remote then one possible scenario of them being missing if remote configuration or credentials don't work.
I can see their size in the Studio UI, so I guess studio can pull them?
Sizes are stored as a metadata in dvc files, so you don't need an access to cache to see those.
Hum, indeed! I ran some tests with my credentials, it seems that whatever AWS_ACCESS_KEY and SERCRET KEY I set, I always have the same error. Looking at the aws console, I see that my key was never used, it seems that studio does not even try to pull the files. Any ideas?
Need to look into more closely. For now I see PermissonError
s and FileNotFoundError
s happening while trying to collect metrics and params.
Seems like I have read access to the s3 dvc repository (s3:GetObject
and s3:ListBucket
), should I set other permissions? The "Default encryption" of my bucket is enabled as well.
UPDATE: I've set S3 full access to all our buckets and the error still appear after a force import
Still on it, if anyone has this question :)
@courentin I found the culprit. It's out.remote
feature of dvc you are using. Studio does not currently support it.
I find it problematic to support in a general case. I.e. if you have metric/plot outputs with different remotes it may mean you need different credentials (or even different types of them: s3/gs/...) for different stuff. This means at least two things will need to be solved to support that:
On the the latter. DVC works usually with one commit at a time and cache is mostly present. Now it fetches absent metric and plot files one by one when cache is absent. Studio needs to do that in bulk because we do it for many commits at a time, all the metrics and empty cache. We have this implemented for a single global remote already but not for this scenario. @pared maybe I am wrong and DVC does have a bulk mechanism like that? Or maybe that could be added?
In the meantime I will explore supporting simpler case when all out.remote
refer to the default/chosen remote anyway. This may be enough for many cases.
BTW, @courentin how/why do you use this feature?
Thank you for the investigation!
We use the multi-remote feature to comply with a data residency constraint we have (as described in https://github.com/iterative/studio-support/issues/50).
Each stage are parametrized and we duplicate the pipeline in multiple remotes, something like:
stages:
train:
foreach: languages
do:
cmd: python ..
outs:
- my_out:
remote: ${key}
if you have metric/plot outputs with different remotes it may mean you need different credentials
In our use case this should not happen: all outputs of one stage all have the same remote. Ideally we'd like to specify the remote option at the stage level instead than the output level but it is not how dvc was design.
Unfortunately, it is a legal constraint that, we can't adjust ..
In our use case this should not happen: all outputs of one stage all have the same remote
So for different stages you still have different remotes? This raises both questions in the prev comment too then.
Studio parses all the stages and all the history. And for now we only work with one remote at a time in Studio.
So for different stages you still have different remotes?
Yes I do!
Got it! This will be pretty limiting for us if we can't work with multi-remotes.
Maybe the solution suggested in https://github.com/iterative/studio-support/issues/50#issuecomment-1207126034 would solve the issue. Would it work if Studio still continue to work one remote at a time and we handle multi-remote with multiple projects?
DVC does have a bulk mechanism like that? Or maybe that could be added?
I will double-check that but considering that brancher
is still a thing I am afraid you are not wrong.
Maybe the solution suggested in https://github.com/iterative/studio-support/issues/50#issuecomment-1207126034 would solve the issue. Would it work if Studio still continue to work one remote at a time and we handle multi-remote with multiple projects?
This is the one we are working on. The mechanism is there already, the UI is in the process.
For this particular issue, we'll need to update our parsing procedure. Particularly the cache collection/fetching part. I am working on it.
The fix for this was released, so metrics and plots with remote set to the same as default (.dvc/config:core.remote
) will be collected now.
The case when out.remote
is different from a default needs to wait until #50 is shipped. It will still only work if outs remote is the same as selected. I.e. we won't collect from multiple remotes in that case either.
I am closing this since it's working in my artificial scenario. Please try how does it work for you and reopen if needed.
When force-importing through studio, I got these error for all my commits:
I've looked at all files on a specific commit and:
dvc metric diff
I know that some of these files were broken in previous commit, but I don't know how that could impact newer commits