Closed XiaohanZhangCMU closed 4 months ago
@XiaohanZhangCMU can you make the PR title and description more informative? In the description, mind adding what the issue was and why this addresses it? thanks :)
@XiaohanZhangCMU can you make the PR title more descriptive before merging?
Description of changes:
Problem:
When calling dataframe_to_mds and writing files to dbfs:/Volumes, the result datasets can have some zero-byte shards or index files. The problem stems from two aspects:
Fix: We add a manual checking of the file size uploaded based on the remote metadata and compare it with the local file size. A mismatch will signal an exception so that Streaming's upload can retry. Experiments show that retry=2 can reliably minimize the chances of "zero-byte uploading".
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)