Open Conchylicultor opened 4 years ago
@Conchylicultor I would like to take up this issue.
Currently, when
dl_manager
is used inside a Beam pipeline, checksums registration is disabled.
By this do you mean register_checksums
is always set to False
?
If yes, how are the checksums generated for beam datasets?
Also, could you point out the location of the issue in code? Sorry for the trouble.
When the dl_manager get sent to the remote workers with beam, the dl_manager is pickled/unpickled which call the __getstate__
function here: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/download/download_manager.py#L194
which will raise an error if register_checksums
is True. So it is not possible to use dl_manager with register_checksums currently.
Currently the datasets which uses this have to special case register_checksums
, like in c4: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/c4.py#L206
Currently, when
dl_manager
is used inside a Beam pipeline, checksums registration is disabled.In theory, each instance of beam workers should register their checksums and save them inside some shared directory, then once all pipeline are complete, the checksums from the workers are merged together and saved.