iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.67k stars 1.17k forks source link

dvc status --json can output non-json #10242

Open gregstarr opened 8 months ago

gregstarr commented 8 months ago

Bug Report

Description

When there are large files to hash which are not cached, dvc status --json will still print out the message, which makes the output not valid json. I believe the use case of dvc status --json is to be able to pipe the output to a file and easily read it with another program, so extra messages make this inconvenient.

I accidentally erased the output I had but I think this is the message that is printed out: https://github.com/iterative/dvc-data/blob/300a3e072e5baba50f7ac5f91240891c0e30d030/src/dvc_data/hashfile/hash.py#L174

Reproduce

  1. large data file stage dependency
  2. dvc status --json for the first time

Expected

dvc status --json only outputs valid json

Environment information

Output of dvc doctor:

DVC version: 3.33.4 (choco)
---------------------------
Platform: Python 3.11.6 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 2.24.0
        dvc_objects = 2.0.1
        dvc_render = 1.0.0
        dvc_task = 0.3.0
        scmrepo = 1.6.0
Supports:
        azure (adlfs = 2023.12.0, knack = 0.11.0, azure-identity = 1.15.0),
        gdrive (pydrive2 = 1.19.0),
        gs (gcsfs = 2023.12.2.post1),
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2023.12.2, boto3 = 1.33.13),
        ssh (sshfs = 2023.10.0)
Config:
        Global: C:\Users\starrgw1\AppData\Local\iterative\dvc
        System: C:\ProgramData\iterative\dvc
dberenbaum commented 8 months ago

I think dvc status -q --json should be returning what you expect (json only) instead of only a return code. We have similar behavior in other commands like dvc data status -q --json. @skshetry What do you think?

gregstarr commented 8 months ago

I saw that in the documentation but it seemed like that would result in no output at all

do not write anything to standard output. Exit with 0 if data and pipelines are up to date, otherwise 1.

https://dvc.org/doc/command-reference/status#-q

dberenbaum commented 8 months ago

You are right, @gregstarr, it outputs nothing today. I mean that we should change the behavior to work this way, since there's no reason to do dvc status -q --json unless you want JSON output. We generally will still show JSON even with -q in these scenarios, like in dvc data status -q --json for example.

gregstarr commented 8 months ago

Personally, I think just --json should basically imply "-q --json" because I can't think of any use case for using --json but wanting extra output

dberenbaum commented 8 months ago

Discussed with the team. There are two issues here:

  1. We should not be writing those additional non-json logging messages to stdout
  2. We should be showing json regardless of -q when --json flag is used

@gregstarr How are you using the command?

gregstarr commented 8 months ago

I have been using it like this:

dvc status --json <target> > status.json

I am using it to check the status of my pipelines and pass the json data to a minimal flask server for viewing.

Basically came as a result of this discussion post: https://discuss.dvc.org/t/ignore-files-in-stage-external-dependency-output/1889/2

I have a periodic task which clears out all the DS_Store files from my remote and checks the status of my pipelines.

dberenbaum commented 8 months ago

@gregstarr dvc status -q --json should now provide only the JSON output.

@skshetry Do you want to keep it open as a reminder to stream logs to stderr here?

gregstarr commented 8 months ago

Thanks for looking into this!

anunayasri commented 3 weeks ago

I was looking at this to contribute the fix. But the fix has already been merged. Request the maintainers to kindly close the issue. cc: @dberenbaum

shcheklein commented 3 weeks ago

@skshetry Do you want to keep it open as a reminder to stream logs to stderr here?

@anunayasri I think that part is left here