iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.69k stars 1.18k forks source link

pull: fails on HDFS after removing `.dvc/cache` #10583

Open zsaladin opened 1 day ago

zsaladin commented 1 day ago

Bug Report

Description

dvc pullfails on HDFS after removing .dvc/cache. It means someone clones the repository at first then dvc pull always fails. But dvc pull -q succeed. So it seems that some log printing causes this problem.

I explain things that may help you to debug hopefully.

  1. Variable total is not a number. It causes the error.
  2. Variable **d contains variable total which is from size
  3. But in this case the variable size is not a number. It is a bound method. here

Reproduce

  1. dvc init
  2. Copy dataset.zip to the directory
  3. dvc remote add -d storage hdfs://user/dvc/mystorage
  4. dvc add dataset.zip
  5. dvc push
  6. rm -rf dataset.zip .dvc/.cache
  7. dvc pull

    Expected

dvc pull and dvc fetch are executed successfully n HDFS.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.10.4-linuxkit-x86_64-with-glibc2.28
Subprojects:
    dvc_data = 3.16.5
    dvc_objects = 5.1.0
    dvc_render = 1.0.2
    dvc_task = 0.4.0
    scmrepo = 3.3.7
Supports:
    azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
    gdrive (pydrive2 = 1.20.0),
    gs (gcsfs = 2024.9.0.post1),
    hdfs (fsspec = 2024.9.0, pyarrow = 17.0.0),
    http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
    oss (ossfs = 2023.12.0),
    s3 (s3fs = 2024.9.0, boto3 = 1.35.16),
    ssh (sshfs = 2024.6.0),
    webdav (webdav4 = 0.10.0),
    webdavs (webdav4 = 0.10.0),
    webhdfs (fsspec = 2024.9.0)
Config:
    Global: /home/user/.config/dvc
    System: /etc/xdg/dvc
Cache types: symlink
Cache directory: fuse.osxfs on osxfs
Caches: local
Remotes: hdfs
Workspace directory: fuse.osxfs on osxfs
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/19c955812b0a09cd409a3779f4e4d774

Additional Information (if any):

I attach error log below.

``` $ dvc pull -v 2024-10-08 15:51:47,388 DEBUG: v3.55.2 (pip), CPython 3.10.12 on Linux-6.10.4-linuxkit-x86_64-with-glibc2.28 2024-10-08 15:51:47,390 DEBUG: command: /home/user/.local/bin/dvc pull -v Collecting |0.00 [00:00, ?entry/s] Fetching2024-10-08 15:51:49,343 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2024-10-08 15:51:50,297 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 2024-10-08 15:51:50,625 DEBUG: Preparing to transfer data from 'hdfs://user/dvc/mystorage/files/md5' to '/home/user/repo/.dvc/cache/files/md5' 2024-10-08 15:51:50,625 DEBUG: Preparing to collect status from '/home/user/repo/.dvc/cache/files/md5' 2024-10-08 15:51:50,625 DEBUG: Collecting status from '/home/user/repo/.dvc/cache/files/md5' 2024-10-08 15:51:50,629 DEBUG: Preparing to collect status from '/user/dvc/mystorage/files/md5' 2024-10-08 15:51:50,630 DEBUG: Collecting status from '/user/dvc/mystorage/files/md5' 2024-10-08 15:51:50,691 DEBUG: Estimated remote size: 256 files 2024-10-08 15:51:50,692 DEBUG: Querying 2 oids via traverse Fetching 0%| |Fetching from hdfs 0/1 [00:00= (total + 0.5): # allow float imprecision (#849) TypeError: unsupported operand type(s) for +: 'method' and 'float' Fetching Exception ignored in: 0/1 [00:00
skshetry commented 11 hours ago
  File "/home/user/.local/share/uv/tools/dvc/lib/python3.10/site-packages/fsspec/spec.py", line 904, in get_file
    callback.set_size(getattr(f1, "size", None))

From above traceback, this looks like a bug in fsspec. Could you please open an issue in https://github.com/fsspec/filesystem_spec?

I agree size should be a property, not a method.