mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.09k stars 136 forks source link

Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes #668

Closed XiaohanZhangCMU closed 4 months ago

XiaohanZhangCMU commented 4 months ago

Description of changes:

Problem:

When calling dataframe_to_mds and writing files to dbfs:/Volumes, the result datasets can have some zero-byte shards or index files. The problem stems from two aspects:

  1. When mapInPandas is called, each executor is assigned a few tasks. Only one Python thread per executor is running through the tasks assigned to the executor. Each task has its MDSWriter initialized which again instantiated a ThreadExecutorPool for file uploading, each thread is responsible for uploading one file. However, when the tasks are more than available processes, the tasks are sharing threads. There can be multiple upload file futures lined up for one thread and stochastically, the upload files will be throttled which results in non-successful uploads, i.e., the file exists but has zero bytes.
  2. Why didn't it retry? It appears that Databricks' filesAPI does not signal any exception when the above uploading failure happens. So our code never attempted to retry.

Fix: We add a manual checking of the file size uploaded based on the remote metadata and compare it with the local file size. A mismatch will signal an exception so that Streaming's upload can retry. Experiments show that retry=2 can reliably minimize the chances of "zero-byte uploading".

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

snarayan21 commented 4 months ago

@XiaohanZhangCMU can you make the PR title and description more informative? In the description, mind adding what the issue was and why this addresses it? thanks :)

snarayan21 commented 4 months ago

@XiaohanZhangCMU can you make the PR title more descriptive before merging?