legal-nlp / oab-exams

data about OAB Exams
Other
10 stars 7 forks source link

AREA attribute validation in XML exams #131

Open odanoburu opened 6 years ago

odanoburu commented 6 years ago

it is possible that a question has more than one AREA value, although this currently only happens for question 24 in 2013-10. we are currently encoding this in XML as an attribute area that has to be one of the enumerated values corresponding to all possible AREA values:

<question number="24" valid="true" area="INTERNATIONAL CONSTITUTIONAL">...</question>
<!ATTLIST question
      number CDATA #REQUIRED
      valid (true | false) #REQUIRED
          area (ETHICS | PHILOSOPHY | CONSTITUTIONAL | HUMAN-RIGHTS
          | INTERNATIONAL | TAXES | ADMINISTRATIVE | ENVIRONMENTAL
          | CIVIL | CHILDREN | CONSUMER | BUSINESS | CIVIL-PROCEDURE
          | CRIMINAL | CRIMINAL-PROCEDURE | LABOUR | LABOUR-PROCEDURE) #IMPLIED>

if the question has not been tagged with an AREA yet, we don't include the attribute.

but DTDs do not allow us to have an attribute value be one or more of the enumerated values, so we either move to XML Schema/Relax NG, or we encode it differently. another possible encoding is using area as an element inside the question element; for questions with multiple areas we simply give it multiple area elements.

<question number="24" valid="true">
  <area>INTERNATIONAL</area>
  <area>CONSTITUTIONAL</area>
...
</question>

however, I do not like this approach because it breaks the invariant that all the content of all the elements is text from the exam. (this is good to have because it makes it easy to retrieve the text.)

a similar idea is to still use area as an element and give it an attribute value, which works well, even if it seems very verbose.

<question number="24" valid="true">
  <area value="INTERNATIONAL"></area>
  <area value="CONSTITUTIONAL"></area>
...
</question>