tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.3k stars 1.54k forks source link

Make register_checksums works with Beam datasets #1785

Open Conchylicultor opened 4 years ago

Conchylicultor commented 4 years ago

Currently, when dl_manager is used inside a Beam pipeline, checksums registration is disabled.

In theory, each instance of beam workers should register their checksums and save them inside some shared directory, then once all pipeline are complete, the checksums from the workers are merged together and saved.

vijayphoenix commented 4 years ago

@Conchylicultor I would like to take up this issue.

Currently, when dl_manager is used inside a Beam pipeline, checksums registration is disabled.

By this do you mean register_checksums is always set to False? If yes, how are the checksums generated for beam datasets?

Also, could you point out the location of the issue in code? Sorry for the trouble.

Conchylicultor commented 4 years ago

When the dl_manager get sent to the remote workers with beam, the dl_manager is pickled/unpickled which call the __getstate__ function here: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/download/download_manager.py#L194 which will raise an error if register_checksums is True. So it is not possible to use dl_manager with register_checksums currently.

Conchylicultor commented 4 years ago

Currently the datasets which uses this have to special case register_checksums, like in c4: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py#L206