iterative / studio-support

❓ DVC Studio Issues, Question, and Discussions
https://studio.iterative.ai
16 stars 1 forks source link

Show missing data for pushed experiments #81

Open dberenbaum opened 1 year ago

dberenbaum commented 1 year ago

VS Code: show when commit has not been pushed to remote

@dberenbaum it might not fall under this issue but we need the same thing for data, right? Should be covered under https://github.com/iterative/studio/issues/5050 / https://github.com/iterative/dvc/issues/9074?

Originally posted by @mattseddon in https://github.com/iterative/studio/issues/4940#issuecomment-1457461091

dberenbaum commented 1 year ago

It would be useful to have a way to show what data is missing from the remote for any given experiment. I don't think we could show this for all experiments in the DVC CLI, but possibly in VS Code and definitely in Studio it would be useful to see.

@mattseddon Let me know if I understood your request correctly. It might be too expensive to check the remote for every local experiment in VS Code by default. On the other hand, Studio has a more static state and might be able to show missing data for every experiment/commit.

mattseddon commented 1 year ago

On the other hand, Studio has a more static state and might be able to show missing data for every experiment/commit.

Definite value add/collaboration multiplier for Studio.

This would be another way to address https://github.com/iterative/vscode-dvc/issues/1769 by using the three products.

dberenbaum commented 1 year ago

IMO DVC/VS Code can reflect remote status for the current workspace but I would only expect Studio to show it for all experiments and commits.

dberenbaum commented 1 year ago

@tapadipti @shcheklein Should we move this to a Studio feature request?

shcheklein commented 1 year ago

yes, not sure what is the user scenario and idea in Studio tbh, but I'm fine to move it

dberenbaum commented 1 year ago

@mattseddon First raised the issue, but IMO the idea would be to show some indicator for files which can't be found in the remote. It could look similar to how we show commits that failed to parse, except for specific file columns.

Screenshot 2023-03-09 at 4 16 58 PM

By showing in Studio what's missing from the commit, we make it more obvious to users that something is missing and what it is. For example, the recent issue you were helping with @shcheklein where the user updated the params.yaml file and DVC then expected a different file which hadn't been generated or pushed yet.

tapadipti commented 1 year ago

From what I know, Studio does not try to access the remote files, except for files with metrics/params. @iterative/studio-backend can confirm.

Is this issue about Studio checking access to all the files and displaying whether or not each file was found?

dberenbaum commented 1 year ago

Is this issue about Studio checking access to all the files and displaying whether or not each file was found?

Yes, I think it should be like running dvc data status/dvc status for each commit in Studio so it's clear whether any files are missing.

tapadipti commented 1 year ago

How important is this going to be - do we expect the user to want to download those files at some point?

dberenbaum commented 1 year ago

How important is this going to be

I don't think it's p1, but I agree with @mattseddon above:

Definite value add/collaboration multiplier for Studio.

Re:

do we expect the user to want to download those files at some point?

Not from Studio. I think it saves the user hassle in knowing which commits are "complete" and whether they or their colleague forgot to push something, or which old commits have been garbage collected.

I mentioned above that I think it would have been useful for https://github.com/iterative/studio-support/issues/75#issuecomment-1452204326, for example. There was confusion about whether model 320 or model 1280 was expected (only model 1280 was pushed). With this warning in Studio, the user would have been alerted that DVC saw model 320 as missing in the latest commit, and while they probably would have still been confused why DVC expected model 320, it would have saved time in narrowing down the problem.

dberenbaum commented 1 year ago

This would basically take care of the "pushed" columns in this request: https://discord.com/channels/485586884165107732/563406153334128681/1085582627806920714. It also relates to https://github.com/iterative/dvc/issues/5369 (checking that everything is up to date without having to pull anything).


Screenshot 2023-03-16 at 5 12 41 PM
ostromann commented 1 year ago

This would basically take care of the "pushed" columns in this request

I have continued working on this (me = Le_Olleaux on Discord). There are several complications to this matter, when it comes to frozen stages, import stages, sub-pipelines etc.. But, using a similar approach like the dvc dag command (https://github.com/iterative/dvc/blob/main/dvc/commands/dag.py), I have now managed to build a tool that can check all these things and insert an overview into a markdown file both as mermaid graph and as table.

It checks for clean and pushed deps and outs, changed commands, and invalidates stages downstream to put them into a "grey/unknown" state. It supports multiple pipelines, specifying targets, ignoring deps-checking on import stages (which can be slow). Adding a simple check and proper return statements should enable a dvc verify command (as proposed in iterative/dvc#5369) quite easily. Not sure if it fits you're plan for it.

Right now, we are using this tool a as pre-commit hook (https://pre-commit.com/) to auto-generate documentation about our pipeline states on each commit. I think it is a feature that belongs to DVC rather than Studio. I can make a pull request of my tool referring to to iterative/dvc#5369.

dberenbaum commented 1 year ago

@ostromann Could you please add that comment to https://github.com/iterative/dvc/issues/5369? It is in our plans to add something like this DVC, so it would be great to hear your thoughts.

I think it is a feature that belongs to DVC rather than Studio.

Yes, sorry for any confusion. There's no intention to limit this functionality to Studio. The reason I mentioned it here is because of my comment above:

IMO DVC/VS Code can reflect remote status for the current workspace but I would only expect Studio to show it for all experiments and commits.

In Studio, it can be helpful to show a similar status check for every commit in your repo, so that you are assured that each commit is "clean" without having to pull and run a DVC command, and you can quickly spot issues with older commits that have some "unclean" status.

dberenbaum commented 1 year ago

I think this is important for model registry artifacts. Right now, I can register a model version and copy the dvc get command shown in Studio for that version, only to find that it fails because the file was never pushed. It would be useful to check ahead of time if the artifact can be found on the remote and warn on forbid registering models if it can't be found. cc @aguschin

efiop commented 1 year ago

Thanks to @dberenbaum for pointing me to this. I have a similar need for cache in https://github.com/iterative/studio/pull/6541, so thinking that we could introduce

$ dvc cache/remote check-missing [--rev REV]
path/metric1
foo
bar
...

to suit both needs in CLI and API.