hassanakbar4 / tractive-test

0 stars 0 forks source link

Python3 cannot read a utf-8 file w/o the BOF being present #338

Closed hassanakbar4 closed 3 years ago

hassanakbar4 commented 6 years ago

component_Version 2 cli resolution_fixed type_defect | by ietf@augustcellars.com


In python3 if there is no BOF marker at the start of the file, but the file is still UTF-8 and not ASCII then there is an error when the first UTF-8 character is reached.


Issue migrated from trac:338 at 2021-10-20 18:25:17 +0500

hassanakbar4 commented 6 years ago

@{"email"=>"ietf@augustcellars.com", "name"=>nil, "username"=>nil} commented


I have made a fix local to my system for this in parser.py about line 440

OLD self.text = six.binary_type(open(self.source, "rU").read(), 'utf8')

NEW self.text = six.binary_type(open(self.source, "rU", encoding='utf8').read(), 'utf8')

This will make the file be read as utf-8 even if there is not a BOM character present in the file. It may be that this needs to be pushed back into the Python2 code as well, but it seems to work just fine based on testing several files.

hassanakbar4 commented 6 years ago

@{"email"=>"julian.reschke@gmx.de", "name"=>nil, "username"=>nil} commented


What's relevant should be the XML declaration, and UTF-8 with or without BOM should work even without. See XML spec... A proper XML parser ought to deal with all these cases...

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} commented


Yes, I suspect this is an incorrect bug report. There's a test to read a unicode xml file in the xml2rfc test suite, and the test suite is run for python 2.7, 3.3, 3.4, and 3.5 using tox before releases (see xml2rfc/trunk/cli/tox.ini).

I think the proposed fix will break xml2rfc under python3 if it is given files with an xml declaration that specifies for instance encoding="latin-1"; that is, any encoding different from ascii and utf-8.

hassanakbar4 commented 6 years ago

@{"email"=>"ietf@augustcellars.com", "name"=>nil, "username"=>nil} uploaded file draft-ietf-quic-http.md (62.6 KiB)

hassanakbar4 commented 6 years ago

@{"email"=>"ietf@augustcellars.com", "name"=>nil, "username"=>nil} uploaded file draft-ietf-quic-http.xml (109.8 KiB)

hassanakbar4 commented 6 years ago

@{"email"=>"ietf@augustcellars.com", "name"=>nil, "username"=>nil} commented


I have added the .md file which sourced the failing .xml file. It has double quotes that are angled rather than straight. The fix I gave makes this file run.

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} commented


Ok, but when running with these combinations:

the provided xml file, with unicode double-quotes, works as-is without any fix.

With which xml2rfc / python / OS versions do you see this failing?

hassanakbar4 commented 6 years ago

@{"email"=>"ietf@augustcellars.com", "name"=>nil, "username"=>nil} commented


I am running python 3.6.3 on windows. I do not know what version is running on the Circle CI work that Martin is doing. https://circleci.com/gh/quicwg/base-drafts/3856?utm_campaign=build-failed&utm_medium=email&utm_source=notification

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} commented


Ok. xml2rfc version and windows version, please?

It would also be good to have the exact failure output.

hassanakbar4 commented 6 years ago

@{"email"=>"martin.thomson@gmail.com", "name"=>nil, "username"=>nil} commented


The docker image I used is here: https://hub.docker.com/r/martinthomson/i-d-template/builds/bfea63z4fkjwv6bkacjbfyk/

As you can see, this is using python 3.5.2 on ubuntu 16.04 with xml2rfc 2.8.2.

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} commented


Replying to hassanakbar4/tractive-test#338 (comment:8):

The docker image I used is here: https://hub.docker.com/r/martinthomson/i-d-template/builds/bfea63z4fkjwv6bkacjbfyk/

As you can see, this is using python 3.5.2 on ubuntu 16.04 with xml2rfc 2.8.2.

Ahh. Splendid. Together with additional info from Jim, this makes me believe the problem could be related to environmental settings, rather than OS or python version. Could you try this patch, please:

Index: xml2rfc/parser.py
===================================================================
--- xml2rfc/parser.py   (revision 2395)
+++ xml2rfc/parser.py   (working copy)
@@ -437,7 +437,7 @@
         if six.PY2:
             self.text = open(self.source, "rU").read()
         else:
-            self.text = six.binary_type(open(self.source, "rU").read(), 'utf8')
+            self.text = open(self.source, "rUb").read()

         # Get an iterating parser object
         file = six.BytesIO(self.text)
hassanakbar4 commented 6 years ago

@{"email"=>"martin.thomson@gmail.com", "name"=>nil, "username"=>nil} commented


That change worked for me.

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} changed resolution from ` tofixed`

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} commented


Excellent.

Fixed in [2396]:

Changed the python 3 code that reads in an xml file to read as binary, in order to not run into issues with unicode conversion before we have had time to look at the encoding attribute of the element.

I've released 2.8.3 with this fix.

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} changed _comment0 which not transferred by tractive

hassanakbar4 commented 6 years ago

@{"email"=>"henrik@levkowetz.com", "name"=>nil, "username"=>nil} changed status from new to closed