Parsing fails when system name contains unicode character

nomad-coe / nomad-parser-vasp

This is a NOMAD parser for VASP. It will read VASP input and output files and provide all information in NOMAD's unified Metainfo based Archive format.

Apache License 2.0

0 stars 2 forks source link

Parsing fails when system name contains unicode character #7

Open ondracka opened 3 years ago

ondracka commented 3 years ago

Another case I found when browsing the VASP entries in our Oasis with failed parsing.

I have actually no idea what this is (this specific character looks like unicode control code for line feed) and how the user managed to get it there(some sort of copy-paste?). See the attached INCAR and first 100 lines of vasprun.xml (the file itself is 75MB big, but I can provide it if needed). So in theory it might be inproper INCAR (it is probably supposed to be ASCII only?), but VASP itself does not complain, its just the xml parsing which fails with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 12, column 45

the problematic line is <i type="string" name="SYSTEM"> C-B-Fe-Mo-P</i>

I found just few cases like this so feel free to close as invalid if you feel this is something the parser can't reasonably handle.

files.zip

markus1978 commented 3 years ago

There are two ways to solve. We either fix the encoding of the file (assuming that it can be written as utf-8 which turns out to be valid xml). Or we make the XML parser deal with the bad encoding.

NOMAD tries to fix the file encoding, if it things it is not utf-8, but there is some guess work involved. Obviously, its not working in this case. Maybe we can improve this based on the given example.

I assume it is failing completely? Or is it producing results and just giving you an error in the logs? We tried to make the parser as generous as possible. Maybe we can do some more. I'll have a look at the provided example.

Ideally, users would fix the encoding before upload. Lets say, you know know what encoding is used on the INCAR (e.g. latin-1). Now vasp simply copies the non uft-8 encoded characters into the .xml byte by byte. Maybe you can transcode the xml from what ever non utf-8 encoding (e.g. latin-1) into utf-8. This might turn the character into a character that would be legal XML. All assuming that this is actually just a file encoding problem.

ondracka commented 3 years ago

I assume it is failing completely? Or is it producing results and just giving you an error in the logs? We tried to make the parser as generous as possible. Maybe we can do some more. I'll have a look at the provided example.

The archive is almost completely empty, except for the program name, version, basis set type (just the stuff which is in the xml before the bad character), so it looks like it just fails when it gets to it.

Ideally, users would fix the encoding before upload. Lets say, you know know what encoding is used on the INCAR (e.g. latin-1). Now vasp simply copies the non uft-8 encoded characters into the .xml byte by byte. Maybe you can transcode the xml from what ever non utf-8 encoding (e.g. latin-1) into utf-8. This might turn the character into a character that would be legal XML. All assuming that this is actually just a file encoding problem.

Let me know if you need a full example.

markus1978 commented 3 years ago

I added a filter for non legal xml 1.0 characters. It basically removes all illegal characters. I am not yet sure about the performance implication, because I have to decode, re, encode the stream. It still need to convince python xml.sax to consume text io right away. Anyhow, if you want to give it a spin in the mean time, its on branch remove-illegal-xml-characters

ondracka commented 3 years ago

The remove-illegal-xml-characters branch makes it work fine here. Let me know if you need more testing (or some performance comparisons).