ERROR: error reading/parsing XML: Input is not proper UTF-8, indicate encoding ! Bytes: 0xB1 0xB1 0xB1 0xEA

oetiker / rrdtool-1.x

RRDtool 1.x - Round Robin Database

http://www.rrdtool.org

GNU General Public License v2.0

1.02k stars 263 forks source link

ERROR: error reading/parsing XML: Input is not proper UTF-8, indicate encoding ! Bytes: 0xB1 0xB1 0xB1 0xEA #1173

Open k79e opened 2 years ago

k79e commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

Use windows version to create a rrdfile write some data into it. Then dump and restore to a new created rrdfile. When restore the xml to rrd it show the error.

... parser error : Input is not proper UTF-8, indicate encoding
!
Bytes: 0xB1 0xB1 0xB1 0xEA
        <lastupdate>1650275305</lastupdate> <!-- 2022-04-18 17:48:25  -->
                                                                       ^
ERROR: error reading/parsing XML: Input is not proper UTF-8, indicate encoding !
 Bytes: 0xB1 0xB1 0xB1 0xEA

Expected behavior A clear and concise description of what you expected to happen. It should work fine

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information): Win7 x64

Additional context Add any other context about the problem here.

I updated error log properly

c72578 commented 2 years ago

@k79e Thanks for your report. Could you please post the following info:

Version of rrdtool, e.g. rrdtool-1.8.0-x64_vcpkg, rrdtool-1.8.0-bin-mingw64
Codepage used in Command Prompt or Powershell. Just enter chcp
Encoding of the xml file after dump. Open the xml file e.g. using Notepad++

k79e commented 2 years ago

I think I tried both vc and mingw version and it's same.

chcp returned active codepage 936

The xml file seems just plain text file. It don't have headers. The xml encode is utf8-no-bom (ansi as utf8)

c72578 commented 2 years ago

Thanks for providing further details. It is unexpected that the xml file does not contain a header. By default RRDtool will add a dtd header to the xml file [1]. It should be:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rrd SYSTEM "https://oss.oetiker.ch/rrdtool/rrdtool.dtd">
<!-- Round Robin Database Dump -->
...

Could you please post the command, how the xml dump is created?

[1] https://oss.oetiker.ch/rrdtool/doc/rrddump.en.html

k79e commented 2 years ago

About the header. I mean the hex header about utf8. Not xml's format.

Maybe this dump can restored normally under linux version.

k79e commented 2 years ago

It's same on linux version And the new version created rrdfile can't be updated on old version (1.7.1)

Whatever I just used that old version created a rrdfile. write a date into then export it which same as windows. rrdtool dump test.rrd test.xml

I find the xml have somewhat difference from win version that is on the "note" The time format of windows including something like binary character but on linux it's plain txt.

e.g linux <lastupdate>1650339082</lastupdate> 

windows <lastupdate>1650304758</lastupdate>  on notepad++ this thing shown as xb1 xb1 xb1 xea.....

Full error log looks like

This binary character can't be shown cmd, it might make error looks weird.

Hex view.

oetiker commented 2 years ago

guess switching to numeric timezones would make sense :-) or go to C locale ...

c72578 commented 2 years ago

@k79e Thanks for the additional information. I could reproduce the issue now. The encoding of the dumped xml file under Windows is ANSI by default, not UTF-8. Special characters like in the timezone are encoded differently. So, for now a solution is to convert the dumped xml file from ANSI to UTF-8 and then restore it. You can use e.g. Windows Notepad to convert the file: Open the dumped xml file Save As... Encoding: UTF-8 Save

Two examples of xml files are attached, one in ANSI encoding after dumping and the other converted to UTF-8, which can be restored by rrdtool: dump_de_ansi.xml.txt dump_de_utf-8.xml.txt

Additional info concerning the topic "Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell": https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window

k79e commented 2 years ago

Finally I tried with utf8 no bom / ucs2 little endian and they both successful.

And I find notepad++ have bug. It don't read xml using system's locale. Also change encoding with notepad++ without select correct codepage caused a bad encode.

c72578 commented 2 years ago

OK, good.

Concerning notepad++: The header of the xml file says it is utf-8 but in fact it is ANSI encoded ... In this case, the encoding needs to be set manually in notepad++, using: Encoding - ANSI As soon as the encoding is switched, the special characters are shown correctly (not xb1 xb1 xb1 anymore). Now you can use Encoding - Convert to UTF-8 and save the file.

c72578 commented 2 years ago

Alternative solution (instead of converting to utf-8 encoding): Correct the specified encoding at the top of the dumped xml file. Example:

<?xml version="1.0" encoding="utf-8"?>
->
<?xml version="1.0" encoding="ISO-8859-1"?>

Character Sets: https://www.iana.org/assignments/character-sets/character-sets.xhtml

k79e commented 2 years ago

Changed to gb2312 on xml head and it works.

(And I find that notepad++ also use that to auto select encode when reading xml file)