iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

ISO 8859-1 filenames break functionnalities such as dvc exp show #8015

Open ldelphinpoulat opened 2 years ago

ldelphinpoulat commented 2 years ago

Bug Report

ISO 8858-1 filenames break functionnalities such as dvc exp show

Description

A file with an ISO-8859-1 character in my case 'ç' was committed to the git repository. The git directory was pushed on a distant server and then retrieved via a pull. Then dvc exp show does not work properly. The filename causes a problem to scmrepo/git/backend/pygit2.py at line 57 (scmrepo==0.0.25, pygit2==1.9.2).

Reproduce

#!/bin/bash

set -exu
wsp=test_wspace
rep=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)

mkdir $rep && pushd $rep

git init
dvc init

echo "m: 1" > params.yaml

dvc run -d params.yaml -o output -n train cp params.yaml output

#git add -A
echo "breaking file" >> 'Fran'$'\347''ais.txt' 

git add -A
git commit -am "initial"

dvc exp show
echo "m: 2" > params.yaml

dvc exp run
dvc exp show

Expected

After typing 'q' for the first dvc exp show, which is allready broken, we get the following error message for the second dvc exp show:

ERROR: unexpected error - 'data'

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Environment information

The bug was generated within a conda environment, where dvc 2.12.1 was installed with pip.

Output of dvc doctor:

$ dvc doctor
DVC version: 2.12.1 (pip)
---------------------------------
Platform: Python 3.10.4 on Linux-5.4.0-121-generic-x86_64-with-glibc2.31
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda3
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda3
Repo: dvc, git

Additional Information (if any):

skshetry commented 2 years ago

@ldelphinpoulat, can you please share the verbose output from dvc exp show -v? It has tracebacks and more logging information.

ldelphinpoulat commented 2 years ago

Here is th log dvc_exp_show.log

ldelphinpoulat commented 2 years ago

@skshetry a workaround is to rename the file 'Fran'$'\347''ais.txt' to 'Francais.txt'. But the initial name is handled correctly from a git point of view.

pmrowla commented 2 years ago

The issue is that exp show output is always utf-8, but git filenames are encoding agnostic (and use the system encoding). We should be handling git filenames with os.fsdecode() in the pygit scmrepo backend before passing them back to the caller (dvc)