iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.96k stars 1.19k forks source link

dvc pull crashing on a FSx Lustre file system #10502

Open rrazavipour opened 3 months ago

rrazavipour commented 3 months ago

Bug Report

dvc pull

Description

dvc pull crashes with sqlite3.OperationError: disk I/O error

Reproduce

this happens trying to pull a 420G of data on an Amazon FSx Lustre filesystem. I complete the git clone I only do a dvc pull, after many hours of operation. I get the mentioned error.

Expected

dvc pull to complete

Environment information

[ec2-user@ip-10-0-1-122 ~]$ dvc doctor DVC version: 3.53.0 (pip)

Platform: Python 3.9.16 on Linux-6.1.97-104.177.amzn2023.x86_64-x86_64-with-glibc2.34 Subprojects: dvc_data = 3.15.1 dvc_objects = 5.1.0 dvc_render = 1.0.2 dvc_task = 0.4.0 scmrepo = 3.3.6 Supports: http (aiohttp = 3.10.0, aiohttp-retry = 2.8.3), https (aiohttp = 3.10.0, aiohttp-retry = 2.8.3), s3 (s3fs = 2024.6.1, boto3 = 1.34.131) Config: Global: /home/ec2-user/.config/dvc System: /etc/xdg/dvc

Output of dvc doctor:

$ dvc doctor

Additional Information (if any): Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/dvc/cli/init.py", line 211, in main ret = cmd.do_run() File "/usr/local/lib/python3.9/site-packages/dvc/cli/command.py", line 27, in do_run return self.run() File "/usr/local/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 35, in run stats = self.repo.pull( File "/usr/local/lib/python3.9/site-packages/dvc/repo/init.py", line 58, in wrapper return f(repo, *args, *kwargs) File "/usr/local/lib/python3.9/site-packages/dvc/repo/pull.py", line 42, in pull stats = self.checkout( File "/usr/local/lib/python3.9/site-packages/dvc/repo/init.py", line 58, in wrapper return f(repo, args, kwargs) File "/usr/local/lib/python3.9/site-packages/dvc/repo/checkout.py", line 142, in checkout diff = compare(old, new, relink=relink, delete=True, callback=pb.as_callback()) File "/usr/local/lib/python3.9/site-packages/dvc_data/index/checkout.py", line 315, in compare ret = _compare( File "/usr/local/lib/python3.9/site-packages/dvc_data/index/checkout.py", line 243, in _compare for change in idiff( File "/usr/local/lib/python3.9/site-packages/dvc_data/index/diff.py", line 320, in diff yield from changes File "/usr/local/lib/python3.9/site-packages/dvc_data/index/diff.py", line 230, in _diff new_dir_items, new_unknown = _get_items(new, key, new_entry, kwargs) File "/usr/local/lib/python3.9/site-packages/dvc_data/index/diff.py", line 152, in _get_items items = dict(index.ls(key, detail=True)) File "/usr/local/lib/python3.9/site-packages/dvc_data/index/view.py", line 128, in ls self._index._ensure_loaded(root_key) File "/usr/local/lib/python3.9/site-packages/dvc_data/index/index.py", line 759, in _ensure_loaded entry = self.get(prefix) File "/usr/lib64/python3.9/_collections_abc.py", line 763, in get return self[key] File "/usr/local/lib/python3.9/site-packages/dvc_data/index/index.py", line 671, in getitem item = self._trie.get(key) File "/usr/lib64/python3.9/_collections_abc.py", line 763, in get return self[key] File "/usr/local/lib/python3.9/site-packages/sqltrie/serialized.py", line 58, in getitem raw = self._trie[key] File "/usr/local/lib/python3.9/site-packages/sqltrie/sqlite/sqlite.py", line 266, in getitem row = self._get_node(key) File "/usr/local/lib/python3.9/site-packages/sqltrie/sqlite/sqlite.py", line 202, in _get_node rows = list(self._traverse(key)) File "/usr/local/lib/python3.9/site-packages/sqltrie/sqlite/sqlite.py", line 191, in _traverse self._conn.executescript(STEPS_SQL.format(path=path, root=self._root_id)) MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/dvc", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/dvc/cli/init.py", line 236, in main ret = _log_exceptions(exc) or 255 File "/usr/local/lib/python3.9/site-packages/dvc/cli/init.py", line 147, in _log_exceptions _log_unknown_exceptions() File "/usr/local/lib/python3.9/site-packages/dvc/cli/init.py", line 49, in _log_unknown_exceptions logger.debug("Version info for developers:\n%s", get_dvc_info()) File "/usr/local/lib/python3.9/site-packages/dvc/info.py", line 38, in get_dvc_info with Repo() as repo: File "/usr/local/lib/python3.9/site-packages/dvc/repo/init.py", line 209, in init self.state = State(self.root_dir, self.site_cache_dir, self.dvcignore) File "/usr/local/lib/python3.9/site-packages/dvc_data/hashfile/state.py", line 92, in init self.links = Cache(links_dir) File "/usr/local/lib/python3.9/site-packages/dvc_data/hashfile/cache.py", line 59, in init super().init(directory=directory, timeout=timeout, disk=disk, **settings) File "/usr/local/lib/python3.9/site-packages/diskcache/core.py", line 478, in init self.reset(key, value, update=False) File "/usr/local/lib/python3.9/site-packages/diskcache/core.py", line 2431, in reset ((old_value,),) = sql( sqlite3.OperationalError: disk I/O error

shcheklein commented 3 months ago

@rrazavipour is there something specific to the structure of this data (e.g. very nested, or too many directories, etc). How many files overall? Is it happening only on this FSx Lustre? What instance size are you using on AWS?

rrazavipour commented 3 months ago

Don’t have the numbers but a large number of directories, about 420Gb all together. it has worked on Mac, Windows and our own GPU machine. this is the first time we are working with dvc and AWS FSx Lustre and seeing these problems. EC2 is 2xlarge, 32 Gb of RAM.