opengovfoundation / sanfranciscocode

StateDecoded for San Francisco
Other
5 stars 3 forks source link

American Legal Parser : Known Issues #4

Open krusynth opened 10 years ago

krusynth commented 10 years ago

This issue is to document everything we know about issues that our parser has with American Legal's data.

SubChapter Titles may be hidden, numbering may be confusing.

A good example here is Charter, Article XVI (0-0-0-1327.xml), where the "THE ARTS, MUSIC, SPORTS, AND PRE-SCHOOL FOR EVERY CHILD AMENDMENT OF 2003" section shows up after SEC. 16.123 with section numbering of the SEC. 16.123.X pattern, even though it has nothing to do with SEC. 16.123.

Expired sections not displayed as sections

In the Administrative Code, Chapter 5 , Article II (0-0-0-1708.xml), Article II is expired, but rather than preserving the sections listed within it, it's just being output as a single chunk of content, like a list. Reading these in as sections fails as a result.

Building Code

All parts of the building code (Plumbing, Mechanical, etc) are all grouped within the building code - only in each file differentiates them. Currently, we're scraping that title to create one building code with substructures for each part; this is less than ideal.

Inconsistent Naming

Some sections begin with "Sec.", some with "Section" and some with "Secs." (in cases of multiples), and these may be all caps or natural case, interchangeably within a file. The Charter and Fire code don't always start titles with "Section", so we use custom parsers that can handle these.

Similar problems exist for Structure names, and structure types may include Chapter, Division, Part, Section (where the actual sections are SubSections), or Appendix.

Subparagraphs

The nested subparagraphs, subsubparagraphs, etc, are not actually nested in the data that we receive. As a result, we have no way of knowing where the nesting should be performed for text sections. These sections generally begin with <TAB tab-count="1"/>#<TAB tab-count="1"/> where # is the letter of the paragraph.

Table of Contents

In most files, the first sub LEVEL encountered is a table of contents. In some cases, this may even be the first two LEVELs. We skip the first one by default, and ignore content that is only a big table with nothing else in the section. Note this may be problematic later. Strangely, these sections always have the toc-section="false" flag set, as do normal sections.

Tables

Tables with heading rows are displayed as two different tables - one for the head, and one for the body. We deal with this by checking for two consecutive tables where the first table only has one row - when this is encountered, we make it all into one table.

Images

Images are all exported as JPEGs, regardless of the original format. We cannot show the Seal of SF legally and we probably do not want to show the ALP Icon where we're not authorized to do so, so we skip over images that match.

krusynth commented 10 years ago

Empty Structures

There are quite a few "Reserved" structures in Chicago's code that have no children. We need to be able to handle these. The obvious solution is to check and see if the section parsing was successful, and if not attempt to parse it as a structure.