Parse Policy Doc PDFs - Githubissues

lunakv commented 2 years ago

The API currently only creates diffs of the CR, because all policy docs are only available as PDFs. To be able to diff policy docs, we must first transform them into some machine-readable representation. There are a number of available PDF parsers available, all working slightly differently, so some research should be done into which one can work best for this use case.

lunakv commented 2 years ago

@multimeric I tried to integrate the MTR grammar you wrote for Venser's Journal, but I ran into some issues regarding bullet lists. Take the current MTR as an example.

The newline after the last list item is erroneously removed during the cleanup process. For example, in section 1.3, the list should end
```
•  Player 
•  Spectator 
The first four roles above are [...],
```
but instead it's parsed as
```
•  Player 
•  Spectator The first four roles above are [...].
```
which makes it impossible to detect where the last item ends and the following paragraph begins.

Sometimes the list itself is parsed incorrectly, inserting extra bullet points. In section 1.4, the first list should read

[...]
• Individuals currently suspended by the DCI. Individuals currently suspended from the DCI may not act as tournament officials;
• Other individuals specifically prohibited from participation by DCI or Wizards of the Coast policy (such determination is at Wizards of the Coast’s sole discretion);
[...]

Instead, I assume because the points each take multiple lines, the second one is parsed as

• Other individuals specifically prohibited from participation by DCI or Wizards of the Coast policy 
• (such determination is at Wizards of the Coast’s sole discretion);

which is just obviously wrong. Sometimes a rogue bullet point is inserted on an empty line, either out of nowhere or behind the list item instead of at the beginning of it (this happens for the first item of the second list in section 1.4).

Since you wrote the thing (and I don't have much experience with this kind of parsing), I was wondering if you had any insight into how to solve these issues before I go digging too deep into it.

multimeric commented 2 years ago

Cool, thanks for looking into this. I remember there were some issues with the parser, but I never finished them off because the VJ guy wasn't actually hosting it anyway.

KingSupernova31 commented 1 year ago

It occurs to me that the MTG Judge Core app is successfully parsing the IPG and MTR. Maybe Andrew Teo would share their data/method for doing that?

lunakv / academyruins-api

Parse Policy Doc PDFs #1