Scrape ITU OB issues in MS Word format from itu.int

strogonoff commented 5 years ago

The goal is to write an utility that scrapes all ITU OB issues in .docx format (just English versions for now).

The issues are here: https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB&version_date=2019

The site was recently down so we should probably throttle requests & be gentle on it.

The utility should store .docx files as <issue ID>/en.docx, where issue ID is a simple integer like 1023.

If MS Word download link on ITU site leads to an archive, it must mean the issue has annexes, and download will contain the issue itself and annexes as separate .docx files. In such cases, the utility should expand the archive and place annexes in the same directory as the issue as <issue ID>/annex<N>-en.docx.

Utility output should not be versioned. If we need to share the downloaded .docx file archive, we can upload it somewhere else.

ronaldtse commented 5 years ago

@strogonoff we should probably scrape all the languages available...?

ronaldtse commented 5 years ago

@andrew2net do you have time for this?

strogonoff commented 5 years ago

@ronaldtse I’d rather have English versions sooner than all languages later (and I don’t want us to accidentally DoS ITU’s site) so I vote to handle it incrementally

strogonoff commented 5 years ago

it’s obvious that we’ll need other languages eventually, so if we have them I won’t mind. They’ll be useless for now though, merging translations will be a challenge for later…

ronaldtse commented 5 years ago

Maybe Relaton-ITU should also provide the links for Word/PDF docs in English/other languages? Thoughts @andrew2net ?

Then this can be a simple wrapper script.

strogonoff commented 5 years ago

Updated issue description to add a note about annex handling, and updated path specification for downloaded contents.

strogonoff commented 5 years ago

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

ronaldtse commented 5 years ago

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

That's fine. Moreover, some OB issues have translations, and some not. So it is necessary to visit every page anyway.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

andrew2net commented 5 years ago

@andrew2net do you have time for this?

@ronaldtse I have a bunch of uncompleted tasks. But of course, you can change prioritizing.

strogonoff commented 5 years ago

@ronaldtse

some OB issues have translations, and some not

Didn’t know, that’s unfortunate.

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

If your point is that Relaton-ITU users may want to reference OB issues, then I can see how this may be in scope.

If we could give Relaton-ITU integer ID of OB issue and get document links in return, this would be easy. We can simply iterate over integers from 1 and gather links until it returns an error. Even if Relaton requires a year in addition to issue ID (OB issue URLs contain the year), that can be worked around.

That said, if Relaton-ITU doesn’t have ITU OB support yet, I believe it may be much faster to write a quick bespoke scraper just for this purpose.

strogonoff commented 5 years ago

This should be on hold until we sort out #20.

strogonoff commented 5 years ago

Rough logic for scraping OB issues from itu.int: approach one

This does not handle issues older than 567, since they are not available through the same archive index on itu.int.

Assumptions

All .docx download URLs follow the following format:
```
https://www.itu.int/dms_pub/itu-t/opb/sp/T-SP-OB.<integer OB ID>-<year>-[OAS-]<file format>-<one-letter language code>.<file extension>
```
- The OAS- substring is present for some issues, seems common for newer issues, not for older.
- File formats we are interested in are MSW (stands for Microsoft Word) and ZIP. The latter is used when there are annexes present.
- File extension can be .doc, .docx (both with MSW) or .zip (with ZIP).

Preliminary tasks

Build a map of languages & one-letter codes. E.g., English is E.
Determine the starting ITU OB & year and formats of interest (depends on #20—whether we have to process older issues in .doc or not)

Logic

This pattern implies non-parallel sequential execution. It can be parallelized (but be careful to throttle downloads to avoid taking ITU site down by accident).

Start with given OB ID and year (567 and 1994 if we choose to auto-process all old issues available on itu.int).
Substitute variables in the URL format string & try downloading a docx for each language.
If URL reports not found, try changing format from docx to zip; then from zip to doc. Try each with a combination of OAS- substring in the URL.
If the result resembles a throttling response or “site is down” type of response, halt the loop for some period of time.
If still not found, increment year by one & retry the probing again.
If downloaded file is a .zip, extract it and determine which of the files is the issue itself and which are the annexes based on filename pattern (annexes usually have the word “annex” in file name, case-insensitive).
Increment OB ID by one and start again.

Pseudocode

Clumsy by trying URL variations with conditionals and keeps state in global variables, but the idea should be clear.

cur_ob = 567
cur_yr = 1994
cur_fmt = 'MSW'
cur_ext = 'doc'
use_oas = false

def download_issue():
  succeeded = download_all_languages()

  # If MSW, try ZIP
  if not succeeded and cur_fmt == 'MSW':
    cur_fmt = 'ZIP'
    cur_ext = 'zip'
    succeeded = download_all_languages()

  # If MSW, try different MS Word format
  if not succeeded and cur_fmt == 'MSW':
    if cur_ext == 'docx':
      cur_ext = 'doc'
    else:
      cur_ext = 'docx'
    succeeded = download_all_languages()

  # If ZIP, try MSW
  if not succeeded and cur_fmt == 'ZIP':
    cur_fmt = 'MSW'
    cur_ext = 'doc'
    succeeded = download_all_languages()

  # Try toggling OAS
  if not succeeded:
    use_oas = not use_oas
    succeeded = download_all_languages()

  # Perhaps we ran out of issues for the year, increment year
  if not succeeded:
    ob_year += 1
    succeeded = download_all_languages()

  # Ran out of ideas
  if not succeeded:
    return report_error("Failed to download issue")

  # One of the tries succeeded, increment issue ID and continue from the top
  cur_ob  = cur_ob + 1
  return download_issue()

def download_all_languages():
  for language in ['E', 'F', …]:
    url = format_url(cur_ob, cur_yr, cur_fmt, use_oas, language, cur_ext)

    try:
      downloaded_file = try_download_file(url)

    except NotFound:
      # English version not found means the URL is broken.
      if language == 'E':
        return False
      # Otherwise we may be fine, not all the languages are always present so we’ll try the next one.
      else:
        continue

    else:
      if is_archive(downloaded_file):
        expand_archive(downloaded_file)
      move_files_in_place(downloaded_file)

  return True

def try_download_file(url):
  try:
    return download_file(url)
  except ServerDownOrThrottling:
    sleep(10)
    return try_download_fille(url)

def format_url(ob, yr, fmt, use_oas, language, ext):
  pass # Outputs an URL according to format string

def download_file(url):
  pass # Does the download

strogonoff commented 5 years ago

Rough scraping logic: approach two

Parse ITU OB archive webpages on ITU.int year by year and issue by issue.
Build a data structure of download URLs. For each issue ID & language combination there will be one download URL (some languages may be missing for some issues).
With that data structure as input, download all issues, expanding ZIP archives and placing/renaming files as needed.

ituob / itu-ob-data