Open strogonoff opened 2 years ago
To clarify, the logic reading each file is:
with open(xml_fpath, 'r', encoding='utf-8') as xml_fhandler:
try:
xml_data = xml_fhandler.read()
except UnicodeDecodeError as err:
# This is what happens with some files.
continue
And the total number of problematic files across all subdirectories is 159, out of 183806 files total.
The old scripts would take the data literally from the 1id-index.txt file input and create an XML, with no checking of the data in there. The 1id-index.txt files themselves were created back then from input that wasn't necessarily checked.
For 159 files, we might be able to deal with them by hand. Unfortunately, it's a guessing game as to what encoding that they DO use. MOST likely it'll be one of the iso-8859 variants, but the windows encodings probably also snuck in a lot.
Understood, I’ll see if it’s low-effort to auto-detect and/or fix encoding in Python and will compile a list of problematic files later…
Complete list of problematic files (these are currently skipped during indexing, meaning if xml2rfc path needs to fall back to one of these the request will fail):
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-09.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1570: invalid start byte)
/bibxml3/reference.I-D.draft-bombadil-netlemmings-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 113: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-ancp-protocol-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 600: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-ancp-protocol-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 600: invalid start byte)
/bibxml3/reference.I-D.draft-irtf-dtnrg-tcp-clayer-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf6 in position 311: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-08.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1566: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-ancp-protocol-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 604: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-smime-ibcs-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 645: invalid start byte)
/bibxml3/reference.I-D.draft-irtf-dtnrg-tcp-clayer-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf6 in position 311: invalid start byte)
/bibxml3/reference.I-D.draft-xu-yang-retargeting-security-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 442: invalid start byte)
/bibxml3/reference.I-D.draft-martin-bfibecms-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 688: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-smime-ibearch-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 454: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-l2vpn-vpls-bridge-interop-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 1011: invalid start byte)
/bibxml3/reference.I-D.draft-naseh-scaling-slb-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 494: invalid start byte)
/bibxml3/reference.I-D.draft-sajassi-l2vpn-vpls-bridge-interop-03.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 1011: invalid start byte)
/bibxml3/reference.I-D.draft-naseh-scaling-slb-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 496: invalid start byte)
/bibxml3/reference.I-D.draft-maes-lemonade-p-imap-12.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 204: invalid continuation byte)
/bibxml3/reference.I-D.draft-ietf-l2vpn-vpls-bridge-interop-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 1008: invalid start byte)
/bibxml3/reference.I-D.draft-naseh-scaling-slb-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 495: invalid start byte)
/bibxml3/reference.I-D.draft-irtf-dtnrg-dtn-uri-scheme-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf6 in position 462: invalid start byte)
/bibxml3/reference.I-D.draft-tian-sa-mip-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-amp-ipv6hcamp-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 584: invalid start byte)
/bibxml3/reference.I-D.draft-yu-tel-dai-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 302: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-smime-bfibecms-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 689: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-amp-ipv6hcamp-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 589: invalid start byte)
/bibxml3/reference.I-D.draft-maes-lemonade-http-binding-04.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 191: invalid continuation byte)
/bibxml3/reference.I-D.draft-chen-afec-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 489: invalid start byte)
/bibxml3/reference.I-D.draft-mnapierala-mvpn-rev-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 595: invalid start byte)
/bibxml3/reference.I-D.draft-uruena-xbe32-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf1 in position 171: invalid continuation byte)
/bibxml3/reference.I-D.draft-thomas-hunter-reed-ospf-lite-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 2192: invalid start byte)
/bibxml3/reference.I-D.draft-janardhan-naveen-rtgwg-equalcostroutes-rip-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 601: invalid start byte)
/bibxml3/reference.I-D.draft-chen-afec-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 485: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-lemonade-profile-07.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 176: invalid continuation byte)
/bibxml3/reference.I-D.draft-yang-mpls-resouce-sharing-mbb-cspf-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xa3 in position 356: invalid start byte)
/bibxml3/reference.I-D.draft-holla-ospf-update-graceful-restart-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 573: invalid start byte)
/bibxml3/reference.I-D.draft-mnapierala-mvpn-part-reqt-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 601: invalid start byte)
/bibxml3/reference.I-D.draft-sethom-dynamic-router-selection-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 455: invalid start byte)
/bibxml3/reference.I-D.uruena-xbe32.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xf1 in position 171: invalid continuation byte)
/bibxml3/reference.I-D.draft-perera-mobopts-motivation-cam-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 431: invalid start byte)
/bibxml3/reference.I-D.draft-holla-ospf-update-graceful-restart-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 577: invalid start byte)
/bibxml3/reference.I-D.draft-sethom-adhoc-gateway-selection-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 807: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-mip4-radius-requirements-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x91 in position 439: invalid start byte)
/bibxml3/reference.I-D.miloucheva-udlr-mipv6.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x96 in position 830: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-cnam-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x91 in position 211: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-06.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1569: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-12.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1573: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 487: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-03.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 487: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-07.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1569: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-11.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1573: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-05.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x91 in position 694: invalid start byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-02.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 772: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-01.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 446: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-enum-infrastructure-00.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 444: invalid start byte)
/bibxml3/reference.I-D.draft-ietf-idwg-idmef-xml-16.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe9 in position 208: invalid continuation byte)
/bibxml3/reference.I-D.draft-martinbeckman-ietf-ipv6-fls-ipv6flowswitching-03.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x93 in position 769: invalid start byte)
/bibxml3/reference.I-D.draft-terrell-math-quant-ternary-logic-of-binary-sys-10.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0x92 in position 1570: invalid start byte)
/bibxml4/reference.W3C.WD-ws-addr-metadata-20070627.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 442: invalid continuation byte)
/bibxml4/reference.W3C.WD-SVG11-20100622.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe8 in position 241: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20070928.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 880: invalid continuation byte)
/bibxml4/reference.W3C.WD-xml-media-types-20041102.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 179: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070928.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 839: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 881: invalid continuation byte)
/bibxml4/reference.W3C.WD-xml-media-types-20040608.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 179: invalid continuation byte)
/bibxml4/reference.W3C.WD-wsdl11elementidentifiers-20070131.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 556: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 736: invalid continuation byte)
/bibxml4/reference.W3C.WD-wsdl11elementidentifiers-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 353: invalid continuation byte)
/bibxml4/reference.W3C.PR-ws-policy-attach-20070706.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 844: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20070810.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 880: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-20070228.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 829: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-attach-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 549: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-20070605.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 536: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070810.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 839: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-addr-wsdl-20060216.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 459: invalid continuation byte)
/bibxml4/reference.W3C.WD-SVG11-20110512.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe8 in position 737: invalid continuation byte)
/bibxml4/reference.W3C.WD-speech-synthesis11-20071212.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe5 in position 369: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20070605.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 244: invalid continuation byte)
/bibxml4/reference.W3C.WD-speech-synthesis11-20070904.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xe5 in position 267: invalid continuation byte)
/bibxml4/reference.W3C.PR-ws-addr-metadata-20070731.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 250: invalid continuation byte)
/bibxml4/reference.W3C.NOTE-xml-media-types-20050502.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 184: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-addr-metadata-20070516.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 250: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-attach-20070228.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 844: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-20070330.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 629: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-policy-attach-20070605.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 748: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-addr-metadata-20070202.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 440: invalid continuation byte)
/bibxml4/reference.W3C.CR-ws-addr-wsdl-20060529.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 246: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-guidelines-20061221.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 778: invalid continuation byte)
/bibxml4/reference.W3C.PR-ws-policy-20070706.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 829: invalid continuation byte)
/bibxml4/reference.W3C.WD-ws-policy-primer-20061221.xml: UnicodeDecodeError ('utf-8' codec can't decode byte 0xc3 in position 539: invalid continuation byte)
The bibxml3/*
files causing problems are encoded in the Windows CP1251 format, evidential as 91 and 92 are used as apostrophes in the original texts.
The bibxml4/*
files causing issues is due to misencoded files for the author "Ümit Yalçınalp". I have manually repaired them now.
Please close when #13 is merged.
UnicodeDecodeError
, example:/bibxml3/sv/reference.I-D.tian-sa-mip.xml
/bibxml3/sv/reference.I-D.levin-simple-msrp-review.xml
. (I suppose it could also be caused by incorrect UTF-8 encoding?)I haven’t looked deeply into this yet and it may be possible to handle on indexing stage, e.g. by trying to guess encoding.
For now the upcoming
xml2rfc_compat.source
indexing logic will skip relevant files, but this means if any of those files will be needed in xml2rfc-style path fallback scenario BibXML service will return an unexpected 404.