inidun / unesco_data_collection

Script and code related to collecting data (scraping) from the UNESCO website
MIT License
2 stars 0 forks source link

unesco_data_collection

Scripts and code related to collecting and curating the UNESCO Courier corpus.

Scripts

Export tagged issues

python courier/elements/export_tagged_issues.py

Extract articles from tagged issues

python courier/cli/tagged2article.py
Usage: tagged2article.py [OPTIONS] SOURCE TARGET_FOLDER [ARTICLE_INDEX]

Options:
  --editorials / --no-editorials
  --supplements / --no-supplements
  --unindexed / --no-unindexed

Generate corpus report

python courier/scripts/corpus_report.py [OUTPUT_FOLDER]

Extract raw issue and page corpora

python courier/scripts/extract_raw_corpora.py

Find double spreads/centerfolds in PDFs

find_double_pages.sh <dir>