partitioning by dask may lead to severe data inconsistencies

When using the current version of the importer v0.11.0, the import procedure of FedGazDe reveals severe inconsistencies of issue data and non-reproducible errors.

After lengthy troubleshooting, we tracked down the problem to the partitioning of issues during the import procedure performed by dask. As a function of the partitioning size, period-wise chunking, and the number of issues, the issues of a single year may be assigned to different partitions. Dask seems to continue with the upload and the subsequent removal of the compressed issue file as soon as it has finished one partition. This behavior may lead to 1) the overwriting of files on s3 and 2) to local FileNotFoundErrors as some of the files may have been deleted already before. Even though these errors occur systematically, they may be observed for ever-changing years as the dask scheduler works not deterministically.

To ensure that the data is not processed any further before all the issues of a single year are compressed, the steps of compressing and uploading/deleting need to be separated from each other.

We cannot be certain about the exact causes that seem to affect only FedGaz. As it seems, the bug may be related to some internal changes of dask rather than to our code or data.

The following commands has been used:

mkdir dask-space
cd dask-space

screen -dmS dask-sched-importer dask-scheduler
screen -dmS dask-work-importer dask-worker localhost:8786 --nprocs 10 --nthreads 1 --memory-limit 7G

cd ..

cp article-info2-FedGazDe.tsv data_tetml-word/FedGazDe/metadata.tsv

python3 impresso-text-acquisition/text_importer/scripts/fedgazimporter.py \
--input-dir=/home/user/aflueck/impresso/data_tetml-word \
--clear --output-dir=/home/user/aflueck/impresso/canonical_json --s3-bucket=TRANSFER \
--log-file=log_data-ingest-FedGazDe.txt \
--access-rights=/home/user/aflueck/impresso/impresso-text-acquisition/text_importer/data/sample_data/Tetml/access_rights.json \
--config-file=/home/user/aflueck/impresso/data_ingestion_config_FedGazDe.json \
--scheduler=127.0.0.1:8786 \
--chunk-size=10

Used dask version:

 % pipenv graph | grep dask
  - dask [required: Any, installed: 2.28.0]
    - dask [required: Any, installed: 2.28.0]
    - dask-k8 [required: Any, installed: 0.1.1]
      - dask [required: >=2.9.0, installed: 2.28.0]

impresso / impresso-text-acquisition

partitioning by dask may lead to severe data inconsistencies #108