libAtoms / extxyz

Extended XYZ specification and parsers
MIT License
16 stars 4 forks source link

Incorrect parsing of badly formatted lattice with 8 entries #6

Closed jameskermode closed 3 years ago

jameskermode commented 3 years ago

As reported at libAtoms/ExtXYZ.jl#12, this file contains a typo in the Lattice and should give an error:

[essswb@mnf144 ExtXYZ]$ cat test.xyz
8
Lattice="5.44 0.0 0.0 0.0 5.44 0.0 0.0 0.05.44" Properties=species:S:1:pos:R:3 Time=0.0
Si        0.00000000      0.00000000      0.00000000
Si        1.36000000      1.36000000      1.36000000
Si        2.72000000      2.72000000      0.00000000
Si        4.08000000      4.08000000      1.36000000
Si        2.72000000      0.00000000      2.72000000
Si        4.08000000      1.36000000      4.08000000
Si        0.00000000      2.72000000      2.72000000
Si        1.36000000      4.08000000      4.08000000

However, it is read successfully but gives a strange cell:

[essswb@mnf144 ExtXYZ]$ python3
Python 3.7.4 (default, Mar 26 2020, 11:57:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import extxyz
>>> extxyz.read("test.xyz")
Atoms(symbols='Si8', pbc=True, cell=[[5.44, 0.0, 0.0], [0.0, 5.44, 0.0], [0.0, 0.05, 0.44]])

For this to be the case there must be an error in the parser, or even in the grammar.

bernstei commented 3 years ago

This appears to be a bug in the grammar - the quoted whitespace-separated-float element is a Repeat(Regex..., mi=1)), which allows for a repeat without any whitespace, so it parses 0.050.44 as 0.050 followed by .44, I suspect. These are separated by a wordbreak regex \b. A List(,, delimiter=' ') doesn't seem to work. What does work for parsing is Sequence(Repeat(Regex(re_float+'\s'), mi=0), Regex(re_float))), but that breaks the C and python tree walking code, in different ways, because there's one additional depth nesting of container tree nodes.

I opened a pyleri issue https://github.com/transceptor-technology/pyleri/issues/19 to see whether this can be handled cleanly. The working option above can definitely be made to work, but requires special parsing of this new tree topology. A pyleri.List with whitespace delimiters seems much preferable.