When using the current version of the importer v0.11.0, the import procedure of FedGazDe reveals severe inconsistencies of issue data and non-reproducible errors.
After lengthy troubleshooting, we tracked down the problem to the partitioning of issues during the import procedure performed by dask. As a function of the partitioning size, period-wise chunking, and the number of issues, the issues of a single year may be assigned to different partitions. Dask seems to continue with the upload and the subsequent removal of the compressed issue file as soon as it has finished one partition. This behavior may lead to 1) the overwriting of files on s3 and 2) to local FileNotFoundErrors as some of the files may have been deleted already before. Even though these errors occur systematically, they may be observed for ever-changing years as the dask scheduler works not deterministically.
To ensure that the data is not processed any further before all the issues of a single year are compressed, the steps of compressing and uploading/deleting need to be separated from each other.
We cannot be certain about the exact causes that seem to affect only FedGaz. As it seems, the bug may be related to some internal changes of dask rather than to our code or data.
When using the current version of the importer
v0.11.0
, the import procedure ofFedGazDe
reveals severe inconsistencies of issue data and non-reproducible errors.After lengthy troubleshooting, we tracked down the problem to the partitioning of issues during the import procedure performed by dask. As a function of the partitioning size, period-wise chunking, and the number of issues, the issues of a single year may be assigned to different partitions. Dask seems to continue with the upload and the subsequent removal of the compressed issue file as soon as it has finished one partition. This behavior may lead to 1) the overwriting of files on s3 and 2) to local
FileNotFoundErrors
as some of the files may have been deleted already before. Even though these errors occur systematically, they may be observed for ever-changing years as the dask scheduler works not deterministically.To ensure that the data is not processed any further before all the issues of a single year are compressed, the steps of compressing and uploading/deleting need to be separated from each other.
We cannot be certain about the exact causes that seem to affect only FedGaz. As it seems, the bug may be related to some internal changes of dask rather than to our code or data.
The following commands has been used:
Used dask version: