impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

partitioning by dask may lead to severe data inconsistencies #108

Closed aflueckiger closed 3 years ago

aflueckiger commented 3 years ago

When using the current version of the importer v0.11.0, the import procedure of FedGazDe reveals severe inconsistencies of issue data and non-reproducible errors.

After lengthy troubleshooting, we tracked down the problem to the partitioning of issues during the import procedure performed by dask. As a function of the partitioning size, period-wise chunking, and the number of issues, the issues of a single year may be assigned to different partitions. Dask seems to continue with the upload and the subsequent removal of the compressed issue file as soon as it has finished one partition. This behavior may lead to 1) the overwriting of files on s3 and 2) to local FileNotFoundErrors as some of the files may have been deleted already before. Even though these errors occur systematically, they may be observed for ever-changing years as the dask scheduler works not deterministically.

To ensure that the data is not processed any further before all the issues of a single year are compressed, the steps of compressing and uploading/deleting need to be separated from each other.

We cannot be certain about the exact causes that seem to affect only FedGaz. As it seems, the bug may be related to some internal changes of dask rather than to our code or data.

The following commands has been used:

mkdir dask-space
cd dask-space

screen -dmS dask-sched-importer dask-scheduler
screen -dmS dask-work-importer dask-worker localhost:8786 --nprocs 10 --nthreads 1 --memory-limit 7G

cd ..

cp article-info2-FedGazDe.tsv data_tetml-word/FedGazDe/metadata.tsv

python3 impresso-text-acquisition/text_importer/scripts/fedgazimporter.py \
--input-dir=/home/user/aflueck/impresso/data_tetml-word \
--clear --output-dir=/home/user/aflueck/impresso/canonical_json --s3-bucket=TRANSFER \
--log-file=log_data-ingest-FedGazDe.txt \
--access-rights=/home/user/aflueck/impresso/impresso-text-acquisition/text_importer/data/sample_data/Tetml/access_rights.json \
--config-file=/home/user/aflueck/impresso/data_ingestion_config_FedGazDe.json \
--scheduler=127.0.0.1:8786 \
--chunk-size=10

Used dask version:

 % pipenv graph | grep dask
  - dask [required: Any, installed: 2.28.0]
    - dask [required: Any, installed: 2.28.0]
    - dask-k8 [required: Any, installed: 0.1.1]
      - dask [required: >=2.9.0, installed: 2.28.0]
aflueckiger commented 3 years ago

The merged PR disentangles the step of compression from uploading entirely.