usgpo / bulk-data

User Guides for XML on the govinfo Bulk Data Repository. For information about Bill Status XML Bulk Data, see https://github.com/usgpo/bill-status.
https://www.govinfo.gov/bulkdata
262 stars 97 forks source link

formatting indentation for clauses #96

Open bwnwtg opened 2 years ago

bwnwtg commented 2 years ago

Hi, I'm looking at ECFR-title19.xml and noticed some issues for the formatting

at <DIV8 N="§ 4.7" NODE="19:1.0.1.1.3.0.1.9" TYPE="SECTION">, there are numbered sub-clauses lumped into the <P> for the alphabetical clause (line breaks are mine):

<P>
(a) The master of every vessel arriving in the United States and required to make entry must have on board the vessel a manifest, as required by section 431, Tariff Act of 1930 (19 U.S.C. 1431), and by this section. The manifest must be legible and complete. If it is in a foreign language, an English translation must be furnished with the original and with any required copies. The required manifest consists of a Vessel Entrance or Clearance Statement, CBP Form 1300, and the following documents:
 (1) Cargo Declaration, CBP Form 1302,
 (2) Ship's Stores Declaration, CBP Form 1303, and
 (3) Crew's Effects Declaration, CBP Form 1304, to which are attached crewmembers' declarations on CBP Form 5129, if the articles will be landed in the United States. Unless the exception at 8 CFR 251.1(a)(6) applies and a paper form is submitted, the master must also electronically submit the data elements required on CBP Form I-418 via an electronic data interchange system approved by CBP, which will be considered part of the manifest. Any document which is not required may be omitted from the manifest provided the word “None” is inserted in items 16, 18, and/or 19 of the Vessel Entrance or Clearance Statement, as appropriate. If a vessel arrives in ballast and therefore the Cargo Declaration is omitted, the legend “No merchandise on board” must be inserted in item 16 of the Vessel Entrance or Clearance Statement.
</P>

and in places where there is some formatting, the alphabetical clause is mixed with a series of numbered sub-clauses. for example, at <DIV8 N="§ 0.1" NODE="19:1.0.1.1.1.0.1.1" TYPE="SECTION">

<P>
(a)
<I>
  Regulations requiring signatures of Treasury and Homeland Security.
</I>
(1) By Treasury Department Order No. 100-16, set forth in the appendix to this part, the Secretary of the Treasury has delegated to the Secretary of Homeland Security the authority to prescribe all CBP regulations relating to customs revenue functions, except that the Secretary of the Treasury retains the sole authority to approve such CBP regulations concerning subject matters listed in paragraph 1(a)(i) of the order. Regulations for which the Secretary of the Treasury retains the sole authority to approve will be signed by the Secretary of Homeland Security (or his or her DHS delegate), and by the Secretary of the Treasury (or his or her Treasury delegate) to indicate approval.
</P>
<P>
(2) When a regulation described in paragraph (a)(1) of this section is published in the
  <E T="04">Federal Register,</E>
the preamble of the document accompanying the regulation will clearly indicate that it is being issued in accordance with paragraph (a)(1) of this section.
</P>

can you please make the xml structure to reflect the indentation of the clauses?

jonquandt commented 2 years ago

Looking into this.

balldarrens commented 2 years ago

I second this - and I am curious as to why there is only XML for full data. Why not json as well? ecfr.org has an api endpoint for full XML but not JSON?

Is bulkdata available in JSON, with correct object level that reflect indentation paths?

For example I am looking at: https://www.govinfo.gov/bulkdata/CFR/2021/title-40/CFR-2021-title40-vol29.xml Specifically section 268.3

The lists here have no indentation class. See clause (c)(1)-(c)(6)

<SECTION>
<SECTNO>§ 268.3</SECTNO>
<SUBJECT>Dilution prohibited as a substitute for treatment.</SUBJECT>
<P>
(a) Except as provided in paragraph (b) of this section, no generator, transporter, handler, or owner or operator of a treatment, storage, or disposal facility shall in any way dilute a restricted waste or the residual from treatment of a restricted waste as a substitute for adequate treatment to achieve compliance with subpart D of this part, to circumvent the effective date of a prohibition in subpart C of this part, to otherwise avoid a prohibition in subpart C of
<PRTPAGE P="175"/>
this part, or to circumvent a land disposal prohibition imposed by RCRA section 3004.
</P>
<P>(b) Dilution of wastes that are hazardous only because they exhibit a characteristic in treatment systems which include land- based units which treat wastes subsequently discharged to a water of the United States pursuant to a permit issued under section 402 of the Clean Water Act (CWA), or which treat wastes in a CWA-equivalent treatment system, or which treat wastes for the purposes of pretreatment requirements under section 307 of the CWA is not impermissible dilution for purposes of this section unless a method other than DEACT has been specified in § 268.40 as the treatment standard, or unless the waste is a D003 reactive cyanide wastewater or nonwastewater.</P>
<P>(c) Combustion of the hazardous waste codes listed in Appendix XI of this part is prohibited, unless the waste, at the point of generation, or after any bona fide treatment such as cyanide destruction prior to combustion, can be demonstrated to comply with one or more of the following criteria (unless otherwise specifically prohibited from combustion):</P>
<P>(1) The waste contains hazardous organic constituents or cyanide at levels exceeding the constituent-specific treatment standard found in § 268.48;</P>
<P>(2) The waste consists of organic, debris-like materials (e.g., wood, paper, plastic, or cloth) contaminated with an inorganic metal-bearing hazardous waste;</P>
<P>(3) The waste, at point of generation, has reasonable heating value such as greater than or equal to 5000 BTU per pound;</P>
<P>(4) The waste is co-generated with wastes for which combustion is a required method of treatment;</P>
<P>(5) The waste is subject to Federal and/or State requirements necessitating reduction of organics (including biological agents); or</P>
<P>(6) The waste contains greater than 1% Total Organic Carbon (TOC).</P>
<P>(d) It is a form of impermissible dilution, and therefore prohibited, to add iron filings or other metallic forms of iron to lead-containing hazardous wastes in order to achieve any land disposal restriction treatment standard for lead. Lead-containing wastes include D008 wastes (wastes exhibiting a characteristic due to the presence of lead), all characteristic wastes containing lead as an underlying hazardous constituent, listed wastes containing lead as a regulated constituent, and hazardous media containing any of the aforementioned lead-containing wastes.</P>
<CITA>[61 FR 15663, Apr. 8, 1996, as amended at 61 FR 33682, June 28, 1996; 63 FR 28639, May 26, 1998]</CITA>
</SECTION>

You can observe the formatting hierarchy here: https://www.ecfr.gov/current/title-40/chapter-I/subchapter-I/part-268/subpart-A/section-268.3

jonquandt commented 2 years ago

@bwnwtg - I apologize for the delay in getting back to you.

This is intentional. Please see the ECFR XML user guide - section 2.4:

2.4. Paragraphs
The <P> element and other elements that are closely related to it (examples: <PSPACE>, <FP>,
<P-1>, etc.) are used extensively in the content sections to separate paragraphs. While
paragraphs are often itemized points in an enumerated list with nested sub-lists, the numbering
scheme is hardcoded in the content and there is no nesting of elements to preserve indentation
levels.

GPO has a major initiative, called XPub, to natively compose more documents in XML. As we move additional documents, such as the Federal Register and Code of Federal Regulations, into our XML-based composition system, support for use cases like this are being considered. Here is a snippet from an early sample CFR for public comment as part of the development of the USLM schema:

                     <section style="-uslm-sgm:SECTION" identifier="/us/cfr/t5/s1200.10">
                        <num value="1200.10" style="-uslm-sgm:SECTNO">§ 1200.10</num>
                        <heading style="-uslm-sgm:SUBJECT">Staff organization and functions.</heading>
                        <paragraph identifier="/us/cfr/t5/s1200.10/a" style="-uslm-sgm:P">
                           <num value="a">(a)</num>
                           <chapeau> The Board's headquarters staff is organized into the following offices and divisions:</chapeau>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/1" style="-uslm-sgm:P">
                              <num value="1">(1)</num>
                              <content> Office of Regional Operations.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/2" style="-uslm-sgm:P">
                              <num value="2">(2)</num>
                              <content> Office of the Administrative Law Judge.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/3" style="-uslm-sgm:P">
                              <num value="3">(3)</num>
                              <content> Office of Appeals Counsel.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/4" style="-uslm-sgm:P">
                              <num value="4">(4)</num>
                              <content> Office of the Clerk of the Board.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/5" style="-uslm-sgm:P">
                              <num value="5">(5)</num>
                              <content> Office of the General Counsel.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/6" style="-uslm-sgm:P">
                              <num value="6">(6)</num>
                              <content> Office of Policy and Evaluation.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/7" style="-uslm-sgm:P">
                              <num value="7">(7)</num>
                              <content> Office of Equal Employment Opportunity.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/8" style="-uslm-sgm:P">
                              <num value="8">(8)</num>
                              <content> Office of Financial and Administrative Management.</content>
                           </paragraph>
                           <paragraph identifier="/us/cfr/t5/s1200.10/a/9" style="-uslm-sgm:P">
                              <num value="9">(9)</num>
                              <content> Office of Information Resources Management.</content>
                           </paragraph>
                        </paragraph>
jonquandt commented 2 years ago

@balldarrens

The govinfo bulkdata repository currently provides XML content, not JSON. At this time, there are no specific plans for us to directly provide JSON equivalents to the existing XML bulk data content for the ECFR. As more of our content is composed natively in XML via XPub, there could be an effort to provide a transform into JSON if there is sufficient community desire for that format. That being said, I wouldn't think that JSON is an ideal format for highly structured and complex content where the order is relevant. I suppose those could be represented within a series of nested arrays.

We do provide json endpoints via the govinfo API to access contents, including the ECFR, but the final content is still XML. Here are some examples so you can see what's possible.

Interactive Documentation ECFR collections request -- to crawl for new/updated content ECFR-title19 json package summary