iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.67k stars 1.18k forks source link

dvc: windows dir out containing files with / in their name will explode #2059

Open Suor opened 5 years ago

Suor commented 5 years ago

Using master branch.

When we save dir cache we convert filenames in it to posix style in a lossy way:

"hey\there" -> "hey/there"
"hey/there" -> "hey/there"

which will lead to hey/there becoming hey\there after back transform, making filename contain path sep, which will lead to cache mismatch and other interesting behavior.

Suor commented 5 years ago

Same issue with other places posix path strings are used as serialization:

efiop commented 5 years ago

@Suor Good catch! Do you suggest that we raise an error if we encounter illegal slash on particular os?

Suor commented 5 years ago

So we may just not permit having / in filenames on Windows or escape it somehow when converting it to posix. That latter could be messy though.

shcheklein commented 5 years ago

/ is not allowed on Windows, right?

how does GIt deal with \ in a path? does not allow them?

Suor commented 5 years ago

You can use / on Windows, but not \, Unix is reverse.

As far as I can see git does nothing to mitigate that. It commits paths containing \ on Unix and / on Windows normally and then everything fails when you checkout on different system.

blakeNaccarato commented 1 year ago

I think this issue may be related to a plot generation bug that I have encountered on Windows. In params.yaml:dirs, I specify file paths. A JSON schema generated by Pydantic facilitates auto-completion of keys in params.yaml:dirs with sensible default paths, and uses \ as the path separator (due to underlying pathlib usage). For example, I can use params.yaml:dirs:results in a dvc.yaml stage as ${dirs:my_plot_file}, and that will lead to, say .\data\results\plot.png being used in dvc.yaml. If a plot is specified in this manner, it will not show up properly in experiments, or Iterative Studio, or via dvc plots show.

I will try to produce a minimal reproduction repo soon, perhaps all it takes is \ in a path in dvc.yaml without all the ${...} funny-business, to break plots. It seems to only break plots for me, as metrics and outs specified with \ don't fail. Its seemingly limited to plots.

This behavior is actually quite difficult to override in Python, due to pathlib uses some dynamic attribute loading to coerce Path to WindowsPath on Windows. I wish I could just tell pathlib to let me use PosixPath on Windows, because Windows supports such paths just fine.

The workaround is to just never use \ in your yaml files, but since tooling like pydantic and pathlib pass around WindowsPath by default, it makes it difficult to automate/de-duplicate things when coupling these tools with dvc.

skshetry commented 1 year ago

@blakeNaccarato, we don't use pydantic. Not sure about the issue here, but most likely we have some plots collection bug where we are assuming posixpath. Would be great if you could post a repro. Thanks.

blakeNaccarato commented 1 year ago

I will find time to repro this bug sometime this weekend, and also determine whether it's related to the issue at hand. Since my issue is related to ingesting \ in dvc.yaml causing issues, where the root issue is about /. The over-arching issue seems to be that DVC is not platform-agnostic regarding path separators in some corner cases.

I figured out a workaround for my specific issue, using Pydantic to generate schemas for params.yaml which then provide templated values to dvc.yaml deps/outs/plots/etc. In case anyone is using this particular combo, this is my workaround below...

Workaround for users of Pydantic and DVC on Windows with path issues

I orchestrate stages inside an editable-installed package in my repo, to facilitate Pylance/pyright auto-completion and refactoring project-wide. Fixing Pydantic schema generation to produce POSIX paths even on Windows amounts to using the schema_extra configuration attribute (see near the bottom of this page in Pydantic docs) to replace \\ with / in the default paths dumped to schema.

src/mypackage/models/dirs.py ```Python from __future__ import annotations from pathlib import Path from typing import Any, Optional from pydantic import BaseModel, DirectoryPath, FilePath, validator from mypackage.models.common import StrPath class Dirs(BaseModel): """Directories relevant to the project.""" class Config: @staticmethod def schema_extra(schema: dict[str, Any], model: Dirs): for prop in schema.get("properties", {}).values(): if default := prop.get("default"): prop["default"] = default.replace("\\", "/") base: DirectoryPath = Path(".") package: DirectoryPath = base / "src/mypackage" stages: DirectoryPath = package / "stages" models: DirectoryPath = package / "models" data: DirectoryPath = base / "data" project_schema: DirectoryPath = data / "schema" ... literature: DirectoryPath = data / "literature" literature_results: DirectoryPath = data / "literature_results" file_literature_results: Path = literature_results / "lit.csv" modelfun: DirectoryPath = data / "modelfun" file_model: Path = modelfun / "model.dillpickle" ... # "always" so it'll run even if not in YAML # "pre" because dir must exist pre-validation @validator( "project_schema", "literature_results", "modelfun", ..., always=True, pre=True, ) def validate_output_directories(cls, directory: StrPath): """Re-create designated output directories each run, for reproducibility.""" directory = Path(directory) directory.mkdir(parents=True, exist_ok=True) return directory ```
blakeNaccarato commented 1 year ago

Okay here's a repro that breaks a plot when there's a \ in the path. The plot path\fail.png doesn't show up in rendered HTML of dvc plots show, while path/pass.png does. The same goes for the plot rendering in the DVC VSCode extension.

The repro is in main, then in fix-offending-plot the \ is swapped for / in the offending plot, then in remove-offending-plot the offending plot is commented out of dvc.yaml. Each branch has the output of dvc plots show committed as well.

daavoo commented 1 year ago

Okay here's a repro that breaks a plot when there's a \ in the path. The plot path\fail.png doesn't show up in rendered HTML of dvc plots show, while path/pass.png does. The same goes for the plot rendering in the DVC VSCode extension.

The repro is in main, then in fix-offending-plot the \ is swapped for / in the offending plot, then in remove-offending-plot the offending plot is commented out of dvc.yaml. Each branch has the output of dvc plots show committed as well.

Closing in favor of https://github.com/iterative/dvc/issues/8689