Closed dberenbaum closed 3 years ago
repo.plots
return after this change?Till now it has been returining vega json specification, which later was wrapped into html. If we want to start supporting images, returning vega specification is no longer the option. I see 2 ways how we can proceed with that:
revision:path_in_repo
data content and probably return this data alongside additional config avialable in dvc.yaml
and handle visualization outside of dvc repo (create some utils called from within CmdPlots)repo.plots
create the HTML reportWorth mentioning that in both cases supporting --show-json
does not make sense
For this issue I would lean towards solution 1, as we already have been talking about letting more advanced users handle their own visualizations and that would be some first step to actually do that.
In later stages (3.0?) we could be returning revision:path_in_repo:url
mapping, but currently I would refrain from doing that, because we are loading the data during iteration over brancher. So this will probably require change of how we handle brancher (related #6039 )
Another thing to note that using brancher and collecting images over multiple revisions might drain machine memory. While it is probably not a very common use case to have that much plots, i think that if someone will think about using dvc plots
to visualize intermediate results of some CNN model, it will be hard to keep up. That is also a reason for proposed "later stages".
In this case I see 2 potential options:
dvc plots
generate directory instead of single-file. That way we can have the images as static resources. Also going this way makes sense with user customization of the webpage. I think that in this case going with 2 makes the most sense.
Those are major issues that need decision before proceeding with implementation. Besides that it seems to me that we need to abstract render
method to be able to handle images. That does not look too complicated, as having read the files in collect
step we just need to save them in proper place and prepare <div>
containing proper img
links.
Great summary! I'm adding some UI questions:
There are 3 different dvc plots
commands:
show
: Options like --all-branches
and --num
could show multiple commits. Commits could be side-by-side in this view or some other configuration that works well for large numbers of commits. Would this be helpful for images or vega or both?diff
: Could we "merge" images somehow, like overlay them on top of each other with some transparency? Would this be helpful? For example, if the images are identical except for the lines of the plots, they might look similar to vega diffs when overlaid, or it could show how bounding boxes on top of a raw image input differ.config
: Do users need to specify whether the plots are vega or images, or can we just use file extensions? Are any other config options supported or needed for images?Recap after talking with @pared
Major issue is what will
repo.plots
return after this change?
The suggested solution 1 makes sense. One issue is needing to coordinate with Studio and VSCode teams.
The format of resulting HTML
The directory structure in solution 2 makes sense. One detail that needs more research is how best to copy all the files from the cache to both limit having to store duplicates but also make the resulting directory something that can be easily shared with others without needing to rely on the local cache.
Commands and options
show
should be kept simple with the existing options for now, showing vega plots and images from the current workspace.diff
should also be kept simple with images side by side for now.config
should not require users to specify whether plots are vega or images (file extensions should be sufficient).Layout of the HTML
Keep this as simple as possible for now. We can iterate later. Other teams like Studio might lead the way here and DVC may be able to implement their designs later.
@Suor @duijf @mattseddon @rogermparent Please see above for the plan to implement image support in dvc and leave any comments/questions below.
@dberenbaum @pared - may be I'm missing this, but do we have any suggestion on how do implement this on the dvc.yaml level?
@efiop btw, why do we have this labeled external-product
? It looks like it's about DVC first, external product will benefit on top of that (and their requirements should be taken into account of course).
Do I understand correctly that you intend to load all images into memory and return a huge structure on repo.plots.collect()
? This looks problematic since people may have a lot of plots/images and images may be big themselves, esp. if uncompressed. People working locally may partially workaround this problem by specifying a target, but for Studio and VSCode this may explode for some monorepo or simply a bigger repo. Can we load contents of files lazily and also have a filename for each, at least if that is checked out? This may be beneficial for DVC itself, i.e. one may use low-level flie copy instead of loading into memory and then dumping back.
@Suor
We have been discussing it with @dberenbaum and decided to start with an implementation based on loading the data during collect
, as low-hanging fruit. My current thoughts on that are that, in the case of dvc
cached files, we could make collect
return cache URL (url
key in result and no data
key means that loading has to be performed during rendering). In this case, "big" use case memory consumption could be limited. In the case of git
- controlled files we would probably have to retain loading straight into memory as I currently do not see a straightforward method of obtaining data location that could be returned via API.
@shcheklein
As my current experiments with this feature progress, I do not see any changes to dvc.yaml
. Images support will probably ignore most (if not all) parameters that can be provided for vega-based plots.
In the case of git - controlled files we would probably have to retain loading straight into memory as I currently do not see a straightforward method of obtaining data location that could be returned via API.
The current approach makes sense to me as a first step.
Some ideas if we need to optimize (that might be terrible ideas 😄 ):
Could we do the following during collect
?
dvc plots show
(from the dvc cache where possible, otherwise using something like git show rev:path > plots/static
).Is that mixing collection and rendering logic too much? If so, can collect
return the revision, the path, and the fact that it's stored in git? The downstream rendering function can do the copying to the static resource directory with that info.
Since we need a static resource directory, in the future we could also consider copying non-image plot files there the same way and reference those as vega data urls instead of inline data (https://vega.github.io/vega-lite/docs/data.html#types-of-data-sources). Not sure that's needed for dvc, but I wonder if it would be helpful for Studio (cc @Suor).
Since we need a static resource directory, in the future we could also consider copying non-image plot files there the same way and reference those as vega data urls instead of inline data
Web browsers are treating it (providing URL as data source) as cross-origin requests, which are allowed only for HTTP, and for locally created HTML data won't be loaded into vega, we will get empty plots.
Okay to understand this better. Do we actually load every possible image in the repo (and consider it a potential plot), or do we load images from the plots
outputs?
Not sure that's needed for dvc, but I wonder if it would be helpful for Studio
I imagined some FileRef
objects inside a dict returned by collect. Then you may .read()
it or get url
, filepath
, rev
whatever identifies it. It would also be helpful to have a way to know file size before reading it.
@shcheklein the logic that file has to be marked as plot remains, so only files found in dvc.yaml
under plots
will be gathered
@pared can we mark a directory as plots?
@Suor yes, in this case properties will be applied to all the files inside.
What kind of props do we expect images might have?
@Suor sorry for the delay, for now, we don't support any properties for images, one can modify them but it will not have any effect. Not restricting it for now as we allow dir plots which might have mixed content.
Closed by #6431
Adding support for handling images as plots outputs so they can be compared between experiment revisions.
The design spec is available in https://www.notion.so/iterative/Experiment-Images-3fd1a4f5df7e4c27be005a1e1e7807bb#d37c871a3c9440d79589b0861b24541b.
This issue should cover research into implementation and should result in a clear implementation plan, including the creation of new issues needed to actually implement the feature.
For example, one unresolved question is how to collect images for the html? If they are inside
<img src=some_image.png>
tags, where issome_image.png
for a particular revision? Do they all get copied to some static directory locally? Do we use the url where it is stored in a dvc or git remote?