lunakv / academyruins-api

Rules API for Magic: The Gathering
https://api.academyruins.com/docs
GNU Affero General Public License v3.0
4 stars 1 forks source link

Parse Policy Doc PDFs #1

Open lunakv opened 2 years ago

lunakv commented 2 years ago

The API currently only creates diffs of the CR, because all policy docs are only available as PDFs. To be able to diff policy docs, we must first transform them into some machine-readable representation. There are a number of available PDF parsers available, all working slightly differently, so some research should be done into which one can work best for this use case.

lunakv commented 2 years ago

@multimeric I tried to integrate the MTR grammar you wrote for Venser's Journal, but I ran into some issues regarding bullet lists. Take the current MTR as an example.

Instead, I assume because the points each take multiple lines, the second one is parsed as

• Other individuals specifically prohibited from participation by DCI or Wizards of the Coast policy 
• (such determination is at Wizards of the Coast’s sole discretion);

which is just obviously wrong. Sometimes a rogue bullet point is inserted on an empty line, either out of nowhere or behind the list item instead of at the beginning of it (this happens for the first item of the second list in section 1.4).

Since you wrote the thing (and I don't have much experience with this kind of parsing), I was wondering if you had any insight into how to solve these issues before I go digging too deep into it.

multimeric commented 2 years ago

Cool, thanks for looking into this. I remember there were some issues with the parser, but I never finished them off because the VJ guy wasn't actually hosting it anyway.

KingSupernova31 commented 1 year ago

It occurs to me that the MTG Judge Core app is successfully parsing the IPG and MTR. Maybe Andrew Teo would share their data/method for doing that?