iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.7k stars 1.18k forks source link

plots: support images #6145

Closed dberenbaum closed 3 years ago

dberenbaum commented 3 years ago

Adding support for handling images as plots outputs so they can be compared between experiment revisions.

The design spec is available in https://www.notion.so/iterative/Experiment-Images-3fd1a4f5df7e4c27be005a1e1e7807bb#d37c871a3c9440d79589b0861b24541b.

This issue should cover research into implementation and should result in a clear implementation plan, including the creation of new issues needed to actually implement the feature.

For example, one unresolved question is how to collect images for the html? If they are inside <img src=some_image.png> tags, where is some_image.png for a particular revision? Do they all get copied to some static directory locally? Do we use the url where it is stored in a dvc or git remote?

pared commented 3 years ago

1. Major issue is what will repo.plots return after this change?

Till now it has been returining vega json specification, which later was wrapped into html. If we want to start supporting images, returning vega specification is no longer the option. I see 2 ways how we can proceed with that:

  1. Make plots "aggregated api.get" that will return revision:path_in_repo data content and probably return this data alongside additional config avialable in dvc.yaml and handle visualization outside of dvc repo (create some utils called from within CmdPlots)
  2. Make repo.plots create the HTML report

Worth mentioning that in both cases supporting --show-json does not make sense

For this issue I would lean towards solution 1, as we already have been talking about letting more advanced users handle their own visualizations and that would be some first step to actually do that. In later stages (3.0?) we could be returning revision:path_in_repo:url mapping, but currently I would refrain from doing that, because we are loading the data during iteration over brancher. So this will probably require change of how we handle brancher (related #6039 )

Another thing to note that using brancher and collecting images over multiple revisions might drain machine memory. While it is probably not a very common use case to have that much plots, i think that if someone will think about using dvc plots to visualize intermediate results of some CNN model, it will be hard to keep up. That is also a reason for proposed "later stages".

2. The format of resulting HTML

In this case I see 2 potential options:

  1. Keep current single-html approach and paste images as base64 encoded strings - possible but inconvinient, also makes editing of html nearly impossible due to sheer amount of text that will be present in even small html with few image plots.
  2. Make dvc plots generate directory instead of single-file. That way we can have the images as static resources. Also going this way makes sense with user customization of the webpage.

I think that in this case going with 2 makes the most sense.

Those are major issues that need decision before proceeding with implementation. Besides that it seems to me that we need to abstract render method to be able to handle images. That does not look too complicated, as having read the files in collect step we just need to save them in proper place and prepare <div> containing proper img links.

dberenbaum commented 3 years ago

Great summary! I'm adding some UI questions:

Commands and options

There are 3 different dvc plots commands:

  1. show: Options like --all-branches and --num could show multiple commits. Commits could be side-by-side in this view or some other configuration that works well for large numbers of commits. Would this be helpful for images or vega or both?
  2. diff: Could we "merge" images somehow, like overlay them on top of each other with some transparency? Would this be helpful? For example, if the images are identical except for the lines of the plots, they might look similar to vega diffs when overlaid, or it could show how bounding boxes on top of a raw image input differ.
  3. config: Do users need to specify whether the plots are vega or images, or can we just use file extensions? Are any other config options supported or needed for images?

Layout of the HTML

  1. Within a single output/path, how to show lots of commits side-by-side?
  2. Between many outputs/paths, how to organize (both vega and images) within the html?
dberenbaum commented 3 years ago

Recap after talking with @pared

Major issue is what will repo.plots return after this change?

The suggested solution 1 makes sense. One issue is needing to coordinate with Studio and VSCode teams.

The format of resulting HTML

The directory structure in solution 2 makes sense. One detail that needs more research is how best to copy all the files from the cache to both limit having to store duplicates but also make the resulting directory something that can be easily shared with others without needing to rely on the local cache.

Commands and options

  1. show should be kept simple with the existing options for now, showing vega plots and images from the current workspace.
  2. diff should also be kept simple with images side by side for now.
  3. config should not require users to specify whether plots are vega or images (file extensions should be sufficient).

Layout of the HTML

Keep this as simple as possible for now. We can iterate later. Other teams like Studio might lead the way here and DVC may be able to implement their designs later.

dberenbaum commented 3 years ago

@Suor @duijf @mattseddon @rogermparent Please see above for the plan to implement image support in dvc and leave any comments/questions below.

shcheklein commented 3 years ago

@dberenbaum @pared - may be I'm missing this, but do we have any suggestion on how do implement this on the dvc.yaml level?

@efiop btw, why do we have this labeled external-product? It looks like it's about DVC first, external product will benefit on top of that (and their requirements should be taken into account of course).

Suor commented 3 years ago

Do I understand correctly that you intend to load all images into memory and return a huge structure on repo.plots.collect()? This looks problematic since people may have a lot of plots/images and images may be big themselves, esp. if uncompressed. People working locally may partially workaround this problem by specifying a target, but for Studio and VSCode this may explode for some monorepo or simply a bigger repo. Can we load contents of files lazily and also have a filename for each, at least if that is checked out? This may be beneficial for DVC itself, i.e. one may use low-level flie copy instead of loading into memory and then dumping back.

pared commented 3 years ago

@Suor We have been discussing it with @dberenbaum and decided to start with an implementation based on loading the data during collect, as low-hanging fruit. My current thoughts on that are that, in the case of dvc cached files, we could make collect return cache URL (url key in result and no data key means that loading has to be performed during rendering). In this case, "big" use case memory consumption could be limited. In the case of git - controlled files we would probably have to retain loading straight into memory as I currently do not see a straightforward method of obtaining data location that could be returned via API.

pared commented 3 years ago

@shcheklein As my current experiments with this feature progress, I do not see any changes to dvc.yaml. Images support will probably ignore most (if not all) parameters that can be provided for vega-based plots.

dberenbaum commented 3 years ago

In the case of git - controlled files we would probably have to retain loading straight into memory as I currently do not see a straightforward method of obtaining data location that could be returned via API.

The current approach makes sense to me as a first step.

Some ideas if we need to optimize (that might be terrible ideas 😄 ):

Could we do the following during collect?

  1. Copy the file to the static resource directory needed for dvc plots show (from the dvc cache where possible, otherwise using something like git show rev:path > plots/static).
  2. Return the path in the static resource directory.

Is that mixing collection and rendering logic too much? If so, can collect return the revision, the path, and the fact that it's stored in git? The downstream rendering function can do the copying to the static resource directory with that info.

Since we need a static resource directory, in the future we could also consider copying non-image plot files there the same way and reference those as vega data urls instead of inline data (https://vega.github.io/vega-lite/docs/data.html#types-of-data-sources). Not sure that's needed for dvc, but I wonder if it would be helpful for Studio (cc @Suor).

pared commented 3 years ago

Since we need a static resource directory, in the future we could also consider copying non-image plot files there the same way and reference those as vega data urls instead of inline data

Web browsers are treating it (providing URL as data source) as cross-origin requests, which are allowed only for HTTP, and for locally created HTML data won't be loaded into vega, we will get empty plots.

shcheklein commented 3 years ago

Okay to understand this better. Do we actually load every possible image in the repo (and consider it a potential plot), or do we load images from the plots outputs?

Suor commented 3 years ago

Not sure that's needed for dvc, but I wonder if it would be helpful for Studio

I imagined some FileRef objects inside a dict returned by collect. Then you may .read() it or get url, filepath, rev whatever identifies it. It would also be helpful to have a way to know file size before reading it.

pared commented 3 years ago

@shcheklein the logic that file has to be marked as plot remains, so only files found in dvc.yaml under plots will be gathered

Suor commented 3 years ago

@pared can we mark a directory as plots?

pared commented 3 years ago

@Suor yes, in this case properties will be applied to all the files inside.

Suor commented 3 years ago

What kind of props do we expect images might have?

pared commented 3 years ago

@Suor sorry for the delay, for now, we don't support any properties for images, one can modify them but it will not have any effect. Not restricting it for now as we allow dir plots which might have mixed content.

dberenbaum commented 3 years ago

Closed by #6431