iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

dvc metrics diff --all: on same branch is empty #10429

Open MaximilianTunk opened 1 month ago

MaximilianTunk commented 1 month ago

Bug Report

Description

Hello, we found that dvc metrics diff --alloutputs nothing, if a_revand b_rev refer to the same git commit. No matter if they are exactly the same or different types of references (HEAD vs branch_name, etc.)

Reproduce

Expected

output metrics-diff table with all values with diff = 0.0

Environment information

Output of dvc doctor:

-------------------------
Platform: Python 3.8.10 on Linux-6.5.0-28-generic-x86_64-with-glibc2.34
Subprojects:
        dvc_data = 2.22.6
        dvc_objects = 1.4.9
        dvc_render = 0.7.0
        dvc_task = 0.4.0
        scmrepo = 1.6.0
Supports:
        http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.3.1, boto3 = 1.34.51)
Config:
        Global: /home/mtunkowitsch/.config/dvc
        System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p6
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p6
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/6a15ed68432f0e8e4dba7e407082545a

Additional Information (if any):

Running the debugger, we noticed that metrics/diff.py:diff expects the results of metrics.show() to have the exact rev keys extracted. However metrics/show.py:show uses the brancher to extract the revs to use. However the brancher groups revs with the same sha and joins them.

This means that when we call dvc metrics diff --all main main the brancher would group main and main and return main,main. Hence the repo.metrics.show() outputs all metrics with the key main,main and the repo.metrics.diff() doesn't find results for main

dberenbaum commented 1 month ago

@MaximilianTunk Great research! Since you have already dug this far, do you think you would be able to contribute a fix?

MaximilianTunk commented 1 month ago

@dberenbaum Thank you :) I'm interested in contributing a fix, but i have to admit it'll be my first one ever to an open source github project. I have a really nice colleague that is here to help any time if I need help or have questions, so fixing it should'nt be any problem. If I have any questions or updates about possible solutions, I'll add a comment to this thread!

adamliter commented 1 week ago

It looks like this issue is also present for dvc params diff --all.

@MaximilianTunk Are you working on this? If you haven't gotten to it, I'd be happy to take a shot at it.

Thinking out loud about the implementation, it seems desirable that dvc.scm.iter_revs returns a Mapping with one entry per commit hash even if passed multiple references/revisions that point to the same commit hash. This seems desirable because there are a few places in the codebase where iteration over the result of Repo.brancher is done (e.g., [1], [2]), and there's not much point in iterating over the same reference/revision multiple times, just to get the same result.

But because of that, it's kind of unclear what a good solution here is. My first thought would be to use some character that's not allowed in git references to join the revision names in dvc.repo.brancher (e.g., : instead of ,), and then later split the dictionary keys on that character, if it's present in any of the keys. But that would have to be done in a few places in the codebase. Writing a wrapper around it seems clunky. Not sure.