Improve handling of Roman numerals

waldoj commented 11 years ago

@twneale points out a use case that is not allowed for, but that should be:

(h) Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    (i) Integer tincidunt, sem eu pretium condimentum.
    (ii) Sed dui justo, euismod nec mattis a, aliquet quis ante.
(i) Nulla dapibus sem et ligula consectetur vitae sagittis arcu varius.
(j) Proin a mauris sit amet enim ullamcorper ultricies vitae id lectus.

This is a non-trivial modification, because it requires statefulness—an understanding, upon “realizing” that it’s in the midst of a list of Roman numerals, that it must backtrack, reevaluate where that list began, and modify the ancestry of those subsections accordingly. If it encountered only a single subsection of (i), that's especially problematic, because it’s two “i”s in a row, and there’s no hint available that one of them should be a Roman numeral and, thus, a child of (h). That requires an understanding of order (alphabetic, numeric, and Roman numeric) that is not currently present in this, but that seems conceptually straightforward to add.

Thom has found the example problem within the U.S. Code, so it’s not merely hypothetical.

Realistically, this is two problems. The first is the ability to recognize and handle Roman numerals properly, which is to say to understand that "i" isn't necessarily the same as "i". Second is the ability to look ahead and understand the unusual-but-extant problem of the use of the Roman numeral "i" following immediately "h."

twneale commented 11 years ago

This is an interesting puzzle for sure. If you're at all inclined to work in Python on this one, I have some oldish code lying around that does a reasonably effective job modeling these ("enumerations", I'm in the habit of calling them): https://github.com/unitedstates/uscode/blob/master/uscode/schemes.py

It needs some upkeep, which I'd be happy to provide, but it's basic use is to model an enumeration like "a" or "1" or "i" and also "a-3" or "ccc" or "dd-3", etc. It tries to break them into tokens ("a-1" --> ["a", "-" "1"]) that can be ordered, or at least the order of which can be guessed, so that it's possible to say things like "a-3 precedes a-5." But I sense this is for State Decoded and you'll need PHP. But for what it's worth, this is an issue that seems to keep coming up for me too in random ways, so I'm in if you want to collaborate.

One slightly creative way to come at this might be to just detect ambiguity in your parser. If so, optionally have the program write out a human-editable markup file that can be deserialized back into a tree. Then most cases would be covered, but truly weird things could still be flagged and given manual attention?

On Wed, Jun 12, 2013 at 2:53 PM, Waldo Jaquith notifications@github.comwrote:

@twneale https://github.com/twneale points out a use casehttps://twitter.com/twneale/status/306080682491396096that is not allowed for, but that should be:

(h) Lorem ipsum dolor sit amet, consectetur adipiscing elit. (i) Integer tincidunt, sem eu pretium condimentum. (ii) Sed dui justo, euismod nec mattis a, aliquet quis ante. (i) Nulla dapibus sem et ligula consectetur vitae sagittis arcu varius. (j) Proin a mauris sit amet enim ullamcorper ultricies vitae id lectus.

This is a non-trivial modification, because it requires statefulness—an understanding, upon “realizing” that it’s in the midst of a list of Roman numerals, that it must backtrack, reevaluate where that list began, and modify the ancestry of those subsections accordingly. If it encountered only a single subsection of (i), that's especially problematic, because it’s two “i”s in a row, and there’s no hint available that one of them should be a Roman numeral and, thus, a child of (h). That requires an understanding of order (alphabetic, numeric, and Roman numeric) that is not currently present in this, but that seems conceptually straightforward to add.

Thom has found the example problem within the U.S. Code, so it’s not merely hypothetical.

Realistically, this is two problems. The first is the ability to recognize and handle Roman numerals properly, which is to say to understand that "i" isn't necessarily the same as "i". Second is the ability to look ahead and understand the unusual-but-extant problem of the use of the Roman numeral "i" following immediately "h."

— Reply to this email directly or view it on GitHubhttps://github.com/statedecoded/subsection-identifier/issues/1 .

waldoj commented 11 years ago

Because this is to be used within The State Decoded, unfortunately it really should be PHP. The good news is I've solved this conceptually—it only remains to execute it. I'm going to break up what's now one pass into two, with the second pass looking both back and forward to see if the identified structural unit is preceded and followed by the expected identifiers, giving special attention to any Roman numerals that could plausibly be letters, and vice-versa. "x" should have been preceded by a "w," and followed by a "y" (if, indeed, the document continues to that point). If "x" is preceded by an "ix," then we know that it's actually a Roman numeral. That's why I'm storing the list of viable identifiers in order, which I'm barely using at this point. All of which sounds a lot like what you've already done in schemes.py—that seems like a good sign. :)

The trick is going to be recognizing that hierarchical documents don't necessary proceed properly, and being able to deal with that. Mistakes happen, as I'm sure you've seen in the structures of laws. Having a human have to touch it would be a worst-case scenario—as you can imagine, that could be a real mess when importing 40,000 laws—but I think you're right, and it's inevitable that such circumstances are possible.

statedecoded / subsection-identifier

Improve handling of Roman numerals #1