libAtoms / extxyz

Extended XYZ specification and parsers
MIT License
16 stars 4 forks source link

Round-tripping string values that look like backward compatible arrays #5

Open Luthaf opened 3 years ago

Luthaf commented 3 years ago

I realized I was not very clear in my last email, so here is a separate issue, hopefully clearer.

I'm wondering about the possibility to round-trip string values, i.e. store some data in extended XYZ and then be able to read it again.

In particular, there is some ambiguity here between string values and backward compatible arrays. If the user asks to store the string 1 3 4 5 (with spaces inside) with the string key in an extended XYZ comment line, simply writing it as a key="1 3 4 5" is not enough, since that will be read later as an array of 4 integers. It is possible to use a single element {} array instead, i.e. write this as key={"1 3 4 5"}, which will then be read as intended by the parser. The example from my email was about a string containing a single number or boolean value (True/...) which can not be written directly as key="3", but the same applies for string that would look like arrays.

This means that any extended XYZ writer must check all string values to ensure they would parse back as string, and otherwise use the {} array trick to make sure the type of the value is preserved in the extended XYZ comment line.

Do you agree with my reading of the spec and is this the intended behaviour?

I fear it will make writing to extended XYZ a bit more complex, since now every software that needs to write to this format also needs to implement a (partial) parser for it; but I also understand this is allowed for backward compatibility reasons.

bernstei commented 3 years ago

I think your observation is correct, due to our unfortunate need for backward compatibility. And I can't think of any way to fix that right now, at least for strings that look like a single item of another primitive type (for things that look like arrays we could make the writer backslash escape the whitespace). You could backslash escape one character in all strings, but that still leaves you unable to represent the string 'n', because '\n' is special, which is ugly.

I was going to suggest using single quotes somehow, but then I realized that those are already supported by the fortran reader, so that sort of defines the backward compatibility requirements. Did we remember to support those in the new standard?

bernstei commented 3 years ago

Obviously we could do something like add a new label for an explicit string (e.g. key=S_"1 2 3"), but writers would either need to use it always, even though it's ugly, or know enough about the standard to decide when it's necessary.

BTW, did we actually make a conscious decision not to support single quotes in the new standard? FWIW the Fortran reader (C, really) currently supports it, if we think that defines our backward compatibility requirements.

Luthaf commented 3 years ago

Single quote string would solve the issue here, writer could always use single quote when writing strings that needs to be quoted.

jameskermode commented 3 years ago

I'm also happy to support single quoted strings. Need to check if they work in the current implementation, and update the spec.

bernstei commented 3 years ago

Are we OK with defining single quotes as strings only, even though the old Fortran (extxyz.c, really) implementation treats ', ", and {} equivalently?

jameskermode commented 3 years ago

Yes, I think that's fine - most writers will have used " for all arrays

bernstei commented 3 years ago

OK, let's add single quotes as a string-only container, and then writers can always use that for strings unambiguously.

gabor1 commented 3 years ago

This goes against the “human readability” a little bit. I would read ‘1 2 3’ as equivalent to “1 2 3”, and it would be confusing if it isn’t. From the original email I like key={"1 3 4 5”} to guarantee that it is read back as a string. then the writer doesn’t need a parser.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 6 Aug 2021, at 21:02, bernstei @.***> wrote:

OK, let's add single quotes as a string-only container, and then writers can always use that for strings unambiguously.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

noambernstein commented 3 years ago

I guess. I was hoping to basically deprecate the confusing aspects of the original syntax, though (arrays in quotes and single element arrays being scalars). This sort of enshrines it (at least the second) instead. I suppose it may still be better than single quotes.

By the way, I think in the old reader any number of nested quotes were equivalent. We’ve already chosen to break that much backward compatibility, But the old reader may well interpret the {“1 2 3“} as an array of ints anyway.