ietf-tools / bibxml-data-archive

rsync mirror of bibxml files from xml2rfc.tools.ietf.org
2 stars 6 forks source link

Files not using UTF-8 and files with NUL characters #1

Open strogonoff opened 2 years ago

strogonoff commented 2 years ago

I haven’t looked deeply into this yet and it may be possible to handle on indexing stage, e.g. by trying to guess encoding.

For now the upcoming xml2rfc_compat.source indexing logic will skip relevant files, but this means if any of those files will be needed in xml2rfc-style path fallback scenario BibXML service will return an unexpected 404.

strogonoff commented 2 years ago

To clarify, the logic reading each file is:

with open(xml_fpath, 'r', encoding='utf-8') as xml_fhandler:
    try:
        xml_data = xml_fhandler.read()
    except UnicodeDecodeError as err:
        # This is what happens with some files.
        continue

And the total number of problematic files across all subdirectories is 159, out of 183806 files total.

TonyLHansen commented 2 years ago

The old scripts would take the data literally from the 1id-index.txt file input and create an XML, with no checking of the data in there. The 1id-index.txt files themselves were created back then from input that wasn't necessarily checked.

For 159 files, we might be able to deal with them by hand. Unfortunately, it's a guessing game as to what encoding that they DO use. MOST likely it'll be one of the iso-8859 variants, but the windows encodings probably also snuck in a lot.

strogonoff commented 2 years ago

Understood, I’ll see if it’s low-effort to auto-detect and/or fix encoding in Python and will compile a list of problematic files later…

strogonoff commented 2 years ago

Complete list of problematic files (these are currently skipped during indexing, meaning if xml2rfc path needs to fall back to one of these the request will fail):

/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-09.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1570: invalid start byte)
/bibxml3/reference.I-D.draft-bombadil-netlemmings-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 113: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-ancp-protocol-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 600: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-ancp-protocol-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 600: invalid start byte)
/bibxml3/reference.I-D.draft-irtf-dtnrg-tcp-clayer-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf6 in position 311: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-08.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1566: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-ancp-protocol-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 604: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-smime-ibcs-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 645: invalid start byte)
/bibxml3/reference.I-D.draft-irtf-dtnrg-tcp-clayer-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf6 in position 311: invalid start byte)
/bibxml3/reference.I-D.draft-xu-yang-retargeting-security-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 442: invalid start byte)
/bibxml3/reference.I-D.draft-martin-bfibecms-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 688: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-smime-ibearch-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 454: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-l2vpn-vpls-bridge-interop-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 1011: invalid start byte)
/bibxml3/reference.I-D.draft-naseh-scaling-slb-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 494: invalid start byte)
/bibxml3/reference.I-D.draft-sajassi-l2vpn-vpls-bridge-interop-03.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 1011: invalid start byte)
/bibxml3/reference.I-D.draft-naseh-scaling-slb-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 496: invalid start byte)
/bibxml3/reference.I-D.draft-maes-lemonade-p-imap-12.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 204: invalid continuation byte)
/bibxml3/reference.I-D.draft-ietf-l2vpn-vpls-bridge-interop-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 1008: invalid start byte)
/bibxml3/reference.I-D.draft-naseh-scaling-slb-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 495: invalid start byte)
/bibxml3/reference.I-D.draft-irtf-dtnrg-dtn-uri-scheme-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf6 in position 462: invalid start byte)
/bibxml3/reference.I-D.draft-tian-sa-mip-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-amp-ipv6hcamp-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 584: invalid start byte)
/bibxml3/reference.I-D.draft-yu-tel-dai-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 302: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-smime-bfibecms-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 689: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-amp-ipv6hcamp-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 589: invalid start byte)
/bibxml3/reference.I-D.draft-maes-lemonade-http-binding-04.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 191: invalid continuation byte)
/bibxml3/reference.I-D.draft-chen-afec-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 489: invalid start byte)
/bibxml3/reference.I-D.draft-mnapierala-mvpn-rev-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 595: invalid start byte)
/bibxml3/reference.I-D.draft-uruena-xbe32-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf1 in position 171: invalid continuation byte)
/bibxml3/reference.I-D.draft-thomas-hunter-reed-ospf-lite-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 2192: invalid start byte)
/bibxml3/reference.I-D.draft-janardhan-naveen-rtgwg-equalcostroutes-rip-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 601: invalid start byte)
/bibxml3/reference.I-D.draft-chen-afec-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 485: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-lemonade-profile-07.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 176: invalid continuation byte)
/bibxml3/reference.I-D.draft-yang-mpls-resouce-sharing-mbb-cspf-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xa3 in position 356: invalid start byte)
/bibxml3/reference.I-D.draft-holla-ospf-update-graceful-restart-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 573: invalid start byte)
/bibxml3/reference.I-D.draft-mnapierala-mvpn-part-reqt-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 601: invalid start byte)
/bibxml3/reference.I-D.draft-sethom-dynamic-router-selection-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 455: invalid start byte)
/bibxml3/reference.I-D.uruena-xbe32.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf1 in position 171: invalid continuation byte)
/bibxml3/reference.I-D.draft-perera-mobopts-motivation-cam-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 431: invalid start byte)
/bibxml3/reference.I-D.draft-holla-ospf-update-graceful-restart-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 577: invalid start byte)
/bibxml3/reference.I-D.draft-sethom-adhoc-gateway-selection-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 807: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-mip4-radius-requirements-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x91 in position 439: invalid start byte)
/bibxml3/reference.I-D.miloucheva-udlr-mipv6.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 830: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-cnam-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x91 in position 211: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-06.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1569: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-12.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1573: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 487: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-03.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 487: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-07.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1569: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-11.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1573: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-05.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x91 in position 694: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 446: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 444: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-idwg-idmef-xml-16.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 208: invalid continuation byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-03.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 769: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-10.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1570: invalid start byte)
/bibxml4/reference.W3C.WD-ws-addr-metadata-20070627.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 442: invalid continuation byte)
/bibxml4/reference.W3C.WD-SVG11-20100622.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe8 in position 241: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20070928.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 880: invalid continuation byte)
/bibxml4/reference.W3C.WD-xml-media-types-20041102.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 179: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070928.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 839: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 881: invalid continuation byte)
/bibxml4/reference.W3C.WD-xml-media-types-20040608.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 179: invalid continuation byte)
/bibxml4/reference.W3C.WD-wsdl11elementidentifiers-20070131.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 556: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 736: invalid continuation byte)
/bibxml4/reference.W3C.WD-wsdl11elementidentifiers-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 353: invalid continuation byte)
/bibxml4/reference.W3C.PR-ws-policy-attach-20070706.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 844: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20070810.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 880: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-20070228.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 829: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-attach-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 549: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-20070605.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 536: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070810.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 839: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-addr-wsdl-20060216.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 459: invalid continuation byte)
/bibxml4/reference.W3C.WD-SVG11-20110512.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe8 in position 737: invalid continuation byte)
/bibxml4/reference.W3C.WD-speech-synthesis11-20071212.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe5 in position 369: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070605.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 244: invalid continuation byte)
/bibxml4/reference.W3C.WD-speech-synthesis11-20070904.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe5 in position 267: invalid continuation byte)
/bibxml4/reference.W3C.PR-ws-addr-metadata-20070731.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 250: invalid continuation byte)
/bibxml4/reference.W3C.NOTE-xml-media-types-20050502.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 184: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-addr-metadata-20070516.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 250: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-attach-20070228.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 844: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 629: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-attach-20070605.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 748: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-addr-metadata-20070202.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 440: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-addr-wsdl-20060529.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 246: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20061221.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 778: invalid continuation byte)
/bibxml4/reference.W3C.PR-ws-policy-20070706.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 829: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20061221.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 539: invalid continuation byte)
ronaldtse commented 2 years ago

The bibxml3/* files causing problems are encoded in the Windows CP1251 format, evidential as 91 and 92 are used as apostrophes in the original texts.

The bibxml4/* files causing issues is due to misencoded files for the author "Ümit Yalçınalp". I have manually repaired them now.

ronaldtse commented 2 years ago

Please close when #13 is merged.