iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.37k stars 1.16k forks source link

Support CSV as Metrics Data Format #10184

Open mnrozhkov opened 6 months ago

mnrozhkov commented 6 months ago

Summary:

We propose adding CSV file support as a metics file format. This feature will allow users to leverage the flexibility and familiarity of CSV, and DataFrame libraries widely used to calculate metrics.

Current Limitations:

Presently, DVC supports the formats JSON, TOML 1.0, or YAML 1.2 files. However, the absence of CSV support restricts its compatibility and integration with common data workflows. It's tedious to convert tabular data into nested JSON format.

Proposed Solution:

Benefits:

Use Case Example:

A data scientist needs to log metrics for a CV model (e.g. vehicle inspection) stored in CSV file.

skshetry commented 6 months ago

Should params support CSV too?

mnrozhkov commented 6 months ago

Should params support CSV too?

Don't think so. The only case when one may have a table of parameters I can imagine is hyper parameters tuning. But it's out of the scope. I have not heard such requests.

dberenbaum commented 6 months ago

Thanks @mnrozhkov! Note that we had an issue open until a couple weeks ago for this: https://github.com/iterative/dvc/issues/5409. It was open for almost 3 years, but there was no discussion or thumbs there, so let's keep in mind that general demand for this feature may be limited.

How should DVC know how to treat each column?

mnrozhkov commented 6 months ago

Good points @dberenbaum !

Should it try to infer the data type of each column and assume numeric columns are values and the others are keys? Unfortunately, CSV has no defined types, so not sure how we will do this without a heavier package like pandas.

  • I think we may assume that "metrics" fields contain numeric values and "keys" are converted to text

In my mind this following dvc.yaml config should be sufficient

metrics:
  - metrics.csv:
      keys: ["Vehicle", "Part]  
      metrics: ["Accuracy", "Count"]  

What if the structure is not as simple as text columns on the left and numeric columns on the right

Do you have any specific example of such a complex structure?

dberenbaum commented 6 months ago

In my mind this following dvc.yaml config should be sufficient

Sorry, I misunderstood and thought you were suggesting that DVC infer the keys and metrics columns. If we specify those in the dvc.yaml, it makes sense. My only question would be the level of effort it would take (cc @skshetry).

uditrana commented 6 months ago

Just wanted to chime in and say that my company is running into this use case decently often across many repositories... mainly because we monitor the same metrics over many slices of the dataset.

shcheklein commented 1 month ago

Folks, I suggest two simple steps first:

I feel if people ready to do something like:

metrics:
  - metrics.csv:
      keys: ["Vehicle", "Part]
      metrics: ["Accuracy", "Count"]

(learning it itself is a lot of time) - they should be fine to dump as json - it’s +/- two (?) lines for code

Also, we would still need to do some default, in case people don't provide this schema. Raise exception? Treat as I described?