Parse publication date, cutoff date from OB issue in .docx format

strogonoff commented 5 years ago

The idea is to write a utility that would

Take as input a local path to a .docx file[0]
Parse that file’s contents into some traversable structure (candidate from Ronald: https://github.com/chrahunt/docx)
Obtain OB issue’s publication date and cutoff date (“Information received by…”) from that structure, converting them to proper date/datetime class instances according to implementation language chosen

The utility will deal with English versions for now, but should be able to handle non-English versions too.

In future, the utility will have to parse the contents from all document pages.

Target output will be YAML such as https://github.com/ituob/itu-ob-data/blob/master/issues/1173/meta.yaml, though for now the focus is on getting around .docx parsing.

[0] For example .docx file, see links on this page (look for MS Word icons): https://www.itu.int/en/publications/ITU-T/pages/publications.aspx?parent=T-SP-OB.1173-2019

ronaldtse commented 5 years ago

@andrew2net do you have time for this?

This gem seem to do the job for DOCX: https://github.com/chrahunt/docx

@strogonoff can you provide a real example as attachment e.g. 1173 DOCX => YAML?

Parse that file’s contents into some traversable structure

Can we assume the traversable structure is just a generic YAML?

strogonoff commented 5 years ago

@strogonoff can you provide a real example as attachment e.g. 1173 DOCX => YAML?

Updated issue description to provide matching links to 1173 docx & YAML.

Parse that file’s contents into some traversable structure

Can we assume the traversable structure is just a generic YAML?

My point was about in-memory data structure at runtime. https://github.com/chrahunt/docx can do that (it offers document.paragraphs and such).

The end output is not specified at this point, dumping YAML or just printing something is fine. The idea is to get some robust traversing of the document going in order to obtain data structures of necessary shape (e.g., hashes/arrays if Ruby), and later add to that.

_{In the end the output will YAML, but the mechanism we’ll use to create YAML will probably not be a dump() function. We want to format YAML consistently for basic human readability & more importantly diffing, so we will probably use some simple templating there. It’s not yet certain but in any case it’ll be a straightforward task. The actual challenges coming up next will be in parsing the rest of the document.}

ronaldtse commented 5 years ago

My point was about in-memory data structure at runtime. https://github.com/chrahunt/docx can do that (it offers document.paragraphs and such).

We should not depend on in-memory structure. We can easily use a per-document YAML to represent a parsed file. i.e. DOCX =(independent parser)=> Per-document YAML =(import into OB)=> OB YAML.

andrew2net commented 5 years ago

@andrew2net do you have time for this?

@ronaldtse I have a bunch of uncompleted tasks. But of course, you can change prioritizing.

strogonoff commented 5 years ago

My point was about in-memory data structure at runtime. https://github.com/chrahunt/docx can do that (it offers document.paragraphs and such).

We should not depend on in-memory structure. We can easily use a per-document YAML to represent a parsed file. i.e. DOCX =(independent parser)=> Per-document YAML =(import into OB)=> OB YAML.

@ronaldtse I do not understand what value that extra intermediate serialization step brings. It is not part of my plan. If it is as easy as dump() followed by load(), then I don’t mind whether it happens or not.

ronaldtse commented 5 years ago

@ronaldtse I do not understand what value that extra intermediate serialization step brings. It is not part of my plan. If it is as easy as dump() followed by load(), then I don’t mind whether it happens or not.

@strogonoff it helps whoever that implements the parser stay independent from the schema used in ituob.org. i.e. the implementer of the parser does not need to know the internals of ituob.org. It allows good separation of responsibilities.

strogonoff commented 5 years ago

Then I’m indifferent as to where we output YAML or just print hashes during utility runtime. At this point we are simply trying to prove parsing .docx is possible at all, the output generated will not be used for any other purpose.

Later the utility can be modified to play well with ITU OB data format as a separate task.

By the way, message format isn’t really ITU OB site internals. It is used by ITU OB site, the ITU OB editor app, possibly more later. (ITU OB site parses the YAML format given into its own internal data structures during build.)

ronaldtse commented 5 years ago

Then I’m indifferent as to where we output YAML or just print hashes during utility runtime.

I'd much rather output YAML because "printing hashes" (1) does not leave a record for comparison and (2) we probably want to deal with serializing all internal components of that hash.

Later the utility can be modified to play well with ITU OB data format as a separate task.

Right.

ituob / itu-ob-data

Parse publication date, cutoff date from OB issue in .docx format #17