Open bwnwtg opened 2 years ago
Looking into this.
I second this - and I am curious as to why there is only XML for full data. Why not json as well? ecfr.org has an api endpoint for full XML but not JSON?
Is bulkdata available in JSON, with correct object level that reflect indentation paths?
For example I am looking at: https://www.govinfo.gov/bulkdata/CFR/2021/title-40/CFR-2021-title40-vol29.xml Specifically section 268.3
The lists here have no indentation class. See clause (c)(1)-(c)(6)
<SECTION>
<SECTNO>§ 268.3</SECTNO>
<SUBJECT>Dilution prohibited as a substitute for treatment.</SUBJECT>
<P>
(a) Except as provided in paragraph (b) of this section, no generator, transporter, handler, or owner or operator of a treatment, storage, or disposal facility shall in any way dilute a restricted waste or the residual from treatment of a restricted waste as a substitute for adequate treatment to achieve compliance with subpart D of this part, to circumvent the effective date of a prohibition in subpart C of this part, to otherwise avoid a prohibition in subpart C of
<PRTPAGE P="175"/>
this part, or to circumvent a land disposal prohibition imposed by RCRA section 3004.
</P>
<P>(b) Dilution of wastes that are hazardous only because they exhibit a characteristic in treatment systems which include land- based units which treat wastes subsequently discharged to a water of the United States pursuant to a permit issued under section 402 of the Clean Water Act (CWA), or which treat wastes in a CWA-equivalent treatment system, or which treat wastes for the purposes of pretreatment requirements under section 307 of the CWA is not impermissible dilution for purposes of this section unless a method other than DEACT has been specified in § 268.40 as the treatment standard, or unless the waste is a D003 reactive cyanide wastewater or nonwastewater.</P>
<P>(c) Combustion of the hazardous waste codes listed in Appendix XI of this part is prohibited, unless the waste, at the point of generation, or after any bona fide treatment such as cyanide destruction prior to combustion, can be demonstrated to comply with one or more of the following criteria (unless otherwise specifically prohibited from combustion):</P>
<P>(1) The waste contains hazardous organic constituents or cyanide at levels exceeding the constituent-specific treatment standard found in § 268.48;</P>
<P>(2) The waste consists of organic, debris-like materials (e.g., wood, paper, plastic, or cloth) contaminated with an inorganic metal-bearing hazardous waste;</P>
<P>(3) The waste, at point of generation, has reasonable heating value such as greater than or equal to 5000 BTU per pound;</P>
<P>(4) The waste is co-generated with wastes for which combustion is a required method of treatment;</P>
<P>(5) The waste is subject to Federal and/or State requirements necessitating reduction of organics (including biological agents); or</P>
<P>(6) The waste contains greater than 1% Total Organic Carbon (TOC).</P>
<P>(d) It is a form of impermissible dilution, and therefore prohibited, to add iron filings or other metallic forms of iron to lead-containing hazardous wastes in order to achieve any land disposal restriction treatment standard for lead. Lead-containing wastes include D008 wastes (wastes exhibiting a characteristic due to the presence of lead), all characteristic wastes containing lead as an underlying hazardous constituent, listed wastes containing lead as a regulated constituent, and hazardous media containing any of the aforementioned lead-containing wastes.</P>
<CITA>[61 FR 15663, Apr. 8, 1996, as amended at 61 FR 33682, June 28, 1996; 63 FR 28639, May 26, 1998]</CITA>
</SECTION>
You can observe the formatting hierarchy here: https://www.ecfr.gov/current/title-40/chapter-I/subchapter-I/part-268/subpart-A/section-268.3
@bwnwtg - I apologize for the delay in getting back to you.
This is intentional. Please see the ECFR XML user guide - section 2.4:
2.4. Paragraphs
The <P> element and other elements that are closely related to it (examples: <PSPACE>, <FP>,
<P-1>, etc.) are used extensively in the content sections to separate paragraphs. While
paragraphs are often itemized points in an enumerated list with nested sub-lists, the numbering
scheme is hardcoded in the content and there is no nesting of elements to preserve indentation
levels.
GPO has a major initiative, called XPub, to natively compose more documents in XML. As we move additional documents, such as the Federal Register and Code of Federal Regulations, into our XML-based composition system, support for use cases like this are being considered. Here is a snippet from an early sample CFR for public comment as part of the development of the USLM schema:
<section style="-uslm-sgm:SECTION" identifier="/us/cfr/t5/s1200.10">
<num value="1200.10" style="-uslm-sgm:SECTNO">§ 1200.10</num>
<heading style="-uslm-sgm:SUBJECT">Staff organization and functions.</heading>
<paragraph identifier="/us/cfr/t5/s1200.10/a" style="-uslm-sgm:P">
<num value="a">(a)</num>
<chapeau> The Board's headquarters staff is organized into the following offices and divisions:</chapeau>
<paragraph identifier="/us/cfr/t5/s1200.10/a/1" style="-uslm-sgm:P">
<num value="1">(1)</num>
<content> Office of Regional Operations.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/2" style="-uslm-sgm:P">
<num value="2">(2)</num>
<content> Office of the Administrative Law Judge.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/3" style="-uslm-sgm:P">
<num value="3">(3)</num>
<content> Office of Appeals Counsel.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/4" style="-uslm-sgm:P">
<num value="4">(4)</num>
<content> Office of the Clerk of the Board.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/5" style="-uslm-sgm:P">
<num value="5">(5)</num>
<content> Office of the General Counsel.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/6" style="-uslm-sgm:P">
<num value="6">(6)</num>
<content> Office of Policy and Evaluation.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/7" style="-uslm-sgm:P">
<num value="7">(7)</num>
<content> Office of Equal Employment Opportunity.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/8" style="-uslm-sgm:P">
<num value="8">(8)</num>
<content> Office of Financial and Administrative Management.</content>
</paragraph>
<paragraph identifier="/us/cfr/t5/s1200.10/a/9" style="-uslm-sgm:P">
<num value="9">(9)</num>
<content> Office of Information Resources Management.</content>
</paragraph>
</paragraph>
@balldarrens
The govinfo bulkdata repository currently provides XML content, not JSON. At this time, there are no specific plans for us to directly provide JSON equivalents to the existing XML bulk data content for the ECFR. As more of our content is composed natively in XML via XPub, there could be an effort to provide a transform into JSON if there is sufficient community desire for that format. That being said, I wouldn't think that JSON is an ideal format for highly structured and complex content where the order is relevant. I suppose those could be represented within a series of nested arrays.
We do provide json endpoints via the govinfo API to access contents, including the ECFR, but the final content is still XML. Here are some examples so you can see what's possible.
Interactive Documentation ECFR collections request -- to crawl for new/updated content ECFR-title19 json package summary
Hi, I'm looking at ECFR-title19.xml and noticed some issues for the formatting
at
<DIV8 N="§ 4.7" NODE="19:1.0.1.1.3.0.1.9" TYPE="SECTION">
, there are numbered sub-clauses lumped into the<P>
for the alphabetical clause (line breaks are mine):and in places where there is some formatting, the alphabetical clause is mixed with a series of numbered sub-clauses. for example, at
<DIV8 N="§ 0.1" NODE="19:1.0.1.1.1.0.1.1" TYPE="SECTION">
can you please make the xml structure to reflect the indentation of the clauses?