ituob / itu-ob-data

ITU Operational Bulletin Data
2 stars 6 forks source link

Scrape ITU OB issues in MS Word format from itu.int #16

Open strogonoff opened 5 years ago

strogonoff commented 5 years ago

The goal is to write an utility that scrapes all ITU OB issues in .docx format (just English versions for now).

The issues are here: https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB&version_date=2019

The site was recently down so we should probably throttle requests & be gentle on it.

The utility should store .docx files as <issue ID>/en.docx, where issue ID is a simple integer like 1023.

If MS Word download link on ITU site leads to an archive, it must mean the issue has annexes, and download will contain the issue itself and annexes as separate .docx files. In such cases, the utility should expand the archive and place annexes in the same directory as the issue as <issue ID>/annex<N>-en.docx.

Utility output should not be versioned. If we need to share the downloaded .docx file archive, we can upload it somewhere else.

ronaldtse commented 5 years ago

@strogonoff we should probably scrape all the languages available...?

ronaldtse commented 5 years ago

@andrew2net do you have time for this?

strogonoff commented 5 years ago

@ronaldtse I’d rather have English versions sooner than all languages later (and I don’t want us to accidentally DoS ITU’s site) so I vote to handle it incrementally

strogonoff commented 5 years ago

it’s obvious that we’ll need other languages eventually, so if we have them I won’t mind. They’ll be useless for now though, merging translations will be a challenge for later…

ronaldtse commented 5 years ago

Maybe Relaton-ITU should also provide the links for Word/PDF docs in English/other languages? Thoughts @andrew2net ?

Then this can be a simple wrapper script.

strogonoff commented 5 years ago

Updated issue description to add a note about annex handling, and updated path specification for downloaded contents.

strogonoff commented 5 years ago

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

ronaldtse commented 5 years ago

Note that OB IDs are sequential integers, but unfortunately each issue’s URL also contains the year of that issue’s publication date (for example, https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1165-2019). I don’t think we can reliably infer the year without actually going through OB archive pages year by year.

That's fine. Moreover, some OB issues have translations, and some not. So it is necessary to visit every page anyway.

From what I saw in Relaton-ITU implementation, it’s outside of what it was intended to do (naturally), but perhaps there is a use for it somewhere…

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

andrew2net commented 5 years ago

@andrew2net do you have time for this?

@ronaldtse I have a bunch of uncompleted tasks. But of course, you can change prioritizing.

strogonoff commented 5 years ago

@ronaldtse

some OB issues have translations, and some not

Didn’t know, that’s unfortunate.

It isn't -- Relaton-ITU is supposed to also provide accessible links for the document. DOC/PDF links are all in scope.

If your point is that Relaton-ITU users may want to reference OB issues, then I can see how this may be in scope.

If we could give Relaton-ITU integer ID of OB issue and get document links in return, this would be easy. We can simply iterate over integers from 1 and gather links until it returns an error. Even if Relaton requires a year in addition to issue ID (OB issue URLs contain the year), that can be worked around.

That said, if Relaton-ITU doesn’t have ITU OB support yet, I believe it may be much faster to write a quick bespoke scraper just for this purpose.

strogonoff commented 5 years ago

This should be on hold until we sort out #20.

strogonoff commented 5 years ago

Rough logic for scraping OB issues from itu.int: approach one

This does not handle issues older than 567, since they are not available through the same archive index on itu.int.

Assumptions

Preliminary tasks

Logic

This pattern implies non-parallel sequential execution. It can be parallelized (but be careful to throttle downloads to avoid taking ITU site down by accident).

Pseudocode

Clumsy by trying URL variations with conditionals and keeps state in global variables, but the idea should be clear.

cur_ob = 567
cur_yr = 1994
cur_fmt = 'MSW'
cur_ext = 'doc'
use_oas = false

def download_issue():
  succeeded = download_all_languages()

  # If MSW, try ZIP
  if not succeeded and cur_fmt == 'MSW':
    cur_fmt = 'ZIP'
    cur_ext = 'zip'
    succeeded = download_all_languages()

  # If MSW, try different MS Word format
  if not succeeded and cur_fmt == 'MSW':
    if cur_ext == 'docx':
      cur_ext = 'doc'
    else:
      cur_ext = 'docx'
    succeeded = download_all_languages()

  # If ZIP, try MSW
  if not succeeded and cur_fmt == 'ZIP':
    cur_fmt = 'MSW'
    cur_ext = 'doc'
    succeeded = download_all_languages()

  # Try toggling OAS
  if not succeeded:
    use_oas = not use_oas
    succeeded = download_all_languages()

  # Perhaps we ran out of issues for the year, increment year
  if not succeeded:
    ob_year += 1
    succeeded = download_all_languages()

  # Ran out of ideas
  if not succeeded:
    return report_error("Failed to download issue")

  # One of the tries succeeded, increment issue ID and continue from the top
  cur_ob  = cur_ob + 1
  return download_issue()

def download_all_languages():
  for language in ['E', 'F', …]:
    url = format_url(cur_ob, cur_yr, cur_fmt, use_oas, language, cur_ext)

    try:
      downloaded_file = try_download_file(url)

    except NotFound:
      # English version not found means the URL is broken.
      if language == 'E':
        return False
      # Otherwise we may be fine, not all the languages are always present so we’ll try the next one.
      else:
        continue

    else:
      if is_archive(downloaded_file):
        expand_archive(downloaded_file)
      move_files_in_place(downloaded_file)

  return True

def try_download_file(url):
  try:
    return download_file(url)
  except ServerDownOrThrottling:
    sleep(10)
    return try_download_fille(url)

def format_url(ob, yr, fmt, use_oas, language, ext):
  pass # Outputs an URL according to format string

def download_file(url):
  pass # Does the download
strogonoff commented 5 years ago

Rough scraping logic: approach two