kislyuk / yq

Command-line YAML, XML, TOML processor - jq wrapper for YAML/XML/TOML documents
https://kislyuk.github.io/yq/
Apache License 2.0
2.57k stars 82 forks source link

xq doesn't handle   #108

Closed BlackthornYugen closed 3 years ago

BlackthornYugen commented 3 years ago

Tidy seems happy with my XML but xq won't parse until I remove  .

╭─🕙─> 
╰─$ grep -n nbsp a.xml             
11182:      <td>&nbsp;</td>
11184:      <td align="right">&nbsp;</td>
11192:      <td>&nbsp;</td>
11194:      <td align="right">&nbsp;</td>

╭─🕙─> 
╰─$ tidy -xml a.xml > /dev/null
No warnings or errors were found.

To learn more about HTML Tidy see http://tidy.sourceforge.net
Please send bug reports to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium

╭─🕙─>
╰─$ xq '.' a.xml | nl | (head;tail)
xq: Error running jq: ExpatError: undefined entity: line 11182, column 10.

╭─🕙─>
╰─$ sed -i.old 's/&nbsp;//' a.xml

╭─🕙─>
╰─$ xq '.' a.xml | nl | (head;tail)
1   {
2     "html": {
3       "@lang": "en",
4       "@xmlns": "http://www.w3.org/1999/xhtml",
5       "head": {
6         "meta": [
7           {
8             "@name": "generator",
9             "@content": "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 17.2), see www.w3.org"
10          },
14944                   "@align": "right"
14945                 },
14946                 null
14947               ]
14948             }
14949           ]
14950         }
14951       }
14952     }
14953   }
kislyuk commented 3 years ago

You're trying to parse HTML using an XML parser. Unfortunately that won't work out of the box. You can just google your error to see extensive discussions of why HTML escape sequences are not valid XML.

There are many hacky workarounds for this, but none are built in to the Python standard library XML parsers. I'm on the fence about what to do here, since I'd like yq to be able to parse HTML files out of the box, but I'd like to find a minimally hacky way to configure the XML parser to do so.

BlackthornYugen commented 3 years ago

Ah, so & isn't valid?

╭─🕐─>
╰─$ echo "<xml>nbsp;</xml>"  | xq                                                                                                                                        1 ↵
{
  "xml": "nbsp;"
}
╭─🕐─> 
╰─$ echo "<xml>&nbsp;</xml>"  | xq
xq: Error running jq: ExpatError: undefined entity: line 1, column 5.
BlackthornYugen commented 3 years ago

Not sure what the least hacky way to solve this would be in yq. For now passing through sed first is fine for me. Here's what I went with:

function parseNexus() {
    curl "https://nexus.mysite.tld/service/rest/repository/browse/${1}/" -s | \
    tidy -asxml 2> /dev/null | \
    sed 's/&nbsp;//' | \
    xq -r '.html.body.table.tr | map(select(. | has("td") and (.td | length) == 4 )) | .[] | .td | .[0] | .a["@href"]'
}

PS: I know that nexus has a rest api; but I didn't want to deal with their pagination stuffs. :)