jzelinskie / faq

Format Agnostic jQ -- process various formats with libjq
Apache License 2.0
440 stars 14 forks source link

problem decoding XML, invalid attribute key label #85

Closed polvi closed 3 years ago

polvi commented 3 years ago
$ curl -s https://www.govinfo.gov/bulkdata/CFR/2020/title-14/CFR-2020-title14-vol2.xml | faq
Error: failed to encode as: invalid attribute key label: #text - due to attributes not being prefixed
Usage:
...

This works...

$ curl -s https://www.govinfo.gov/content/pkg/CFR-2020-title14-vol2/xml/CFR-2020-title14-vol2-sec91-205.xml | faq
jzelinskie commented 3 years ago

I reduced this down to this failure:

faq <<EOF
<p>
  <span>something</span>
  text here
</p>
EOF
Error: failed to encode as: invalid attribute key label: #text - due to attributes not being prefixed
...

Here's the upstream error in the XML library we're using: https://github.com/clbanning/mxj/blob/13245dc365b0de3547c9845087941f04817e7936/xml.go#L1125-L1131

Taking a deeper look in a bit.

polvi commented 3 years ago

This also looks like wrong behavior:

faq <<EOF
<p>
  text here
  <span>something</span>
</p>
EOF
<p>text here</p>
jzelinskie commented 3 years ago

The author fixed this issue upstream.

I updated the dependency, and discovered another upstream bug by running it through the same document: https://github.com/clbanning/mxj/issues/91

Right now that specific document parses fine, can get jq expressions ran on it, but cannot be converted back into XML.

# Blocked on #91
curl -s https://www.govinfo.gov/bulkdata/CFR/2020/title-14/CFR-2020-title14-vol2.xml | ./faq
Error: failed to encode as pretty: xml.Decoder.Token() - XML syntax error on line 1: invalid character entity & (no semicolon)

# Works
curl -s https://www.govinfo.gov/bulkdata/CFR/2020/title-14/CFR-2020-title14-vol2.xml | ./faq -o json | head
{
  "CFRDOC": {
    "-noNamespaceSchemaLocation": "CFRMergedXML.xsd",
    "-xsi": "http://www.w3.org/2001/XMLSchema-instance",
    "AMDDATE": "Jan. 1, 2020",
    "BMTR": {
      "ALPHLIST": {
        "AGENCY": [
          "Administrative Conference of the United States",
          "Advisory Council on Historic Preservation",
jzelinskie commented 3 years ago

Ok both of these got fixed in fb4f6a4c352298b10c7677c9acfaa0dd78ec97d2 and 7f3a4184279af050fb1ee3ae146da716cea243f8