ioccc-src / mkiocccentry

Form an IOCCC entry as a compressed tarball file
Other
28 stars 5 forks source link

Enhancement: finish the C-based general JSON parser #156

Closed xexyl closed 1 year ago

xexyl commented 2 years ago

As I said in the other thread I am also just typing on my phone and I am about to go get some sleep but I can answer your questions tomorrow @lcn2.

As I also said I will be gone most of Saturday and probably will take a few days to recover.

Finally feel free to edit the title and this message or else let me know what you think is a good start.

We can then discuss what I have so far and how we should proceed with the parser.

I will say quickly before I go that I have kind of held back a bit to see where the structs (or the functions acting on them) go as I think it will be helpful to have something more to work on. Once the structs are sorted it should be easier to actually write the rules and actions in the lexer and parser.

I am not sure but there very possibly is missing grammar too.

Hope this is a good start for the issue. Have a good rest of your day and look forward seeing more here my friend!

Have a safe trip home when you go back.

TODO:

xexyl commented 2 years ago

Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be zeroed or cleared or empty or something like that.

What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.

There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running calloc() or using memset() to zeroize sections of memory causes massive churn in the memory / VM system.

Sorry. I meant clearer. I was talking about the word only.

So the dynamic array facility won't zeroize by default.

TODO: We need to add macros to dyn_alloc.h for backward compatibility, BTW.

Sounds good.

Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.

Good point. But still could be zeroed or something like that. I'm only talking about using words that are in the dictionary. Or is zeroi[sz]e actually a term that I'm computer unfamiliar with? I've seen different ways of saying the same thing but never this one.

We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)

And that's what I'm getting at exactly. As you're sure to know I don't use Merriam Webster - I use OED and if I didn't use that I would use another British English one - so I've never heard of it. More to the point though I was thinking of this for vim spelling more than anything else - though other reasons apply as well.

Most dictionaries lag behind common usage: they attempt to describe (instead of proscribe) the language on the date of their publication.

So not having the zeroize in a zeroize definition for such a term of art such as zeroize is not surprising. :-)

As a logophile who's read the dictionary since childhood (and constantly loses track of time because of getting lost in the dictionary) I'm aware of how lexicographers work, I can assure you :)

xexyl commented 2 years ago

As a logophile who's read the dictionary since childhood (and constantly loses track of time because of getting lost in the dictionary) I'm aware of how lexicographers work, I can assure you :)

That being said I am thinking that most system dictionaries won't have it either. Perhaps this should be discussed in the preparing for an official release thread though. Or even another thread on language: I don't know.

xexyl commented 2 years ago

As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:

         %x str

         %%
                 char string_buf[MAX_STR_CONST];
                 char *string_buf_ptr;

         \"      string_buf_ptr = string_buf; BEGIN(str);

         <str>\"        { /* saw closing quote - all done */
                 BEGIN(INITIAL);
                 *string_buf_ptr = '\0';
                 /* return string constant token type and
                  * value to parser
                  */
                 }

         <str>\n        {
                 /* error - unterminated string constant */
                 /* generate error message */
                 }

         <str>\\[0-7]{1,3} {
                 /* octal escape sequence */
                 int result;

                 (void) sscanf( yytext + 1, "%o", &result );

                 if ( result > 0xff )
                         /* error, constant is out-of-bounds */

                 *string_buf_ptr++ = result;
                 }

         <str>\\[0-9]+ {
                 /* generate error - bad escape sequence; something
                  * like '\48' or '\0777777'
                  */
                 }

         <str>\\n  *string_buf_ptr++ = '\n';
         <str>\\t  *string_buf_ptr++ = '\t';
         <str>\\r  *string_buf_ptr++ = '\r';
         <str>\\b  *string_buf_ptr++ = '\b';
         <str>\\f  *string_buf_ptr++ = '\f';

         <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];

         <str>[^\\\n\"]+        {
                 char *yptr = yytext;

                 while ( *yptr )
                         *string_buf_ptr++ = *yptr++;
                 }

Which deals with escape characters as well but also finds the final unescaped ". I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.

lcn2 commented 2 years ago

Which would you prefer, adding a strip boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding the strip boolean as a final argument to those functions would be cleaner.

I think it would be cleaner to add it to the function as well yes.

With commit eed3230e2be51a53f454ad25a00378490619c786:

Added quote arg to JSON string conversion

Added quote arg to:

    extern struct json *calloc_json_conv_string(char const *str, size_t len, bool strict, bool quote);
    extern calloc_json_conv_string_str(char const *str, size_t *retlen, bool strict, bool quote);

As in:

/*
 ...
 *      quote   true ==> ignore JSON double quotes, both str[0] & str[len-1] must be "
 *              false ==> the entire str is to be converted
 ...
 */

If calloc_json_conv_string() or calloc_json_conv_string_str() is called
with quote == true, then if str is not enclosed in JSON double-quotes,
a warning will be issued AND conversion will not performed AND
converted will be set to false.

We hope this helps, @xexyl.

xexyl commented 2 years ago

Which would you prefer, adding a strip boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding the strip boolean as a final argument to those functions would be cleaner.

I think it would be cleaner to add it to the function as well yes.

With commit eed3230:

Added quote arg to JSON string conversion

Added quote arg to:

    extern struct json *calloc_json_conv_string(char const *str, size_t len, bool strict, bool quote);
    extern calloc_json_conv_string_str(char const *str, size_t *retlen, bool strict, bool quote);

As in:

/*
 ...
 *      quote   true ==> ignore JSON double quotes, both str[0] & str[len-1] must be "
 *              false ==> the entire str is to be converted
 ...
 */

If calloc_json_conv_string() or calloc_json_conv_string_str() is called
with quote == true, then if str is not enclosed in JSON double-quotes,
a warning will be issued AND conversion will not performed AND
converted will be set to false.

We hope this helps, @xexyl.

Sounds good and it will be of help when I get to that point. Thank you!

lcn2 commented 2 years ago

As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:

         %x str

         %%
                 char string_buf[MAX_STR_CONST];
                 char *string_buf_ptr;

         \"      string_buf_ptr = string_buf; BEGIN(str);

         <str>\"        { /* saw closing quote - all done */
                 BEGIN(INITIAL);
                 *string_buf_ptr = '\0';
                 /* return string constant token type and
                  * value to parser
                  */
                 }

         <str>\n        {
                 /* error - unterminated string constant */
                 /* generate error message */
                 }

         <str>\\[0-7]{1,3} {
                 /* octal escape sequence */
                 int result;

                 (void) sscanf( yytext + 1, "%o", &result );

                 if ( result > 0xff )
                         /* error, constant is out-of-bounds */

                 *string_buf_ptr++ = result;
                 }

         <str>\\[0-9]+ {
                 /* generate error - bad escape sequence; something
                  * like '\48' or '\0777777'
                  */
                 }

         <str>\\n  *string_buf_ptr++ = '\n';
         <str>\\t  *string_buf_ptr++ = '\t';
         <str>\\r  *string_buf_ptr++ = '\r';
         <str>\\b  *string_buf_ptr++ = '\b';
         <str>\\f  *string_buf_ptr++ = '\f';

         <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];

         <str>[^\\\n\"]+        {
                 char *yptr = yytext;

                 while ( *yptr )
                         *string_buf_ptr++ = *yptr++;
                 }

Which deals with escape characters as well but also finds the final unescaped ". I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.

We see an issue with the above code as it does not match the JSON spec.

JSON has very specific -escape characters for JSON encoded strings:

\"
\\
\/
\b
\f
\n
\r
\t
\uXXXX

In the \uXXXX case, X is an ASCII ]0-9a-fA-F] character, and there MUST 4 of them.

The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.

xexyl commented 2 years ago

As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:

         %x str

         %%
                 char string_buf[MAX_STR_CONST];
                 char *string_buf_ptr;

         \"      string_buf_ptr = string_buf; BEGIN(str);

         <str>\"        { /* saw closing quote - all done */
                 BEGIN(INITIAL);
                 *string_buf_ptr = '\0';
                 /* return string constant token type and
                  * value to parser
                  */
                 }

         <str>\n        {
                 /* error - unterminated string constant */
                 /* generate error message */
                 }

         <str>\\[0-7]{1,3} {
                 /* octal escape sequence */
                 int result;

                 (void) sscanf( yytext + 1, "%o", &result );

                 if ( result > 0xff )
                         /* error, constant is out-of-bounds */

                 *string_buf_ptr++ = result;
                 }

         <str>\\[0-9]+ {
                 /* generate error - bad escape sequence; something
                  * like '\48' or '\0777777'
                  */
                 }

         <str>\\n  *string_buf_ptr++ = '\n';
         <str>\\t  *string_buf_ptr++ = '\t';
         <str>\\r  *string_buf_ptr++ = '\r';
         <str>\\b  *string_buf_ptr++ = '\b';
         <str>\\f  *string_buf_ptr++ = '\f';

         <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];

         <str>[^\\\n\"]+        {
                 char *yptr = yytext;

                 while ( *yptr )
                         *string_buf_ptr++ = *yptr++;
                 }

Which deals with escape characters as well but also finds the final unescaped ". I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.

We see an issue with the above code as it does not match the JSON spec.

JSON has very specific -escape characters for JSON encoded strings:

\"
\\
\/
\b
\f
\n
\r
\t
\uXXXX

In the \uXXXX case, X is an ASCII ]0-9a-fA-F] character, and there MUST 4 of them.

The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.

You misunderstand. This was from the manual of flex on C strings. I was just saying that I am wondering if I will have to do something like this for JSON because of the complexities involved (which might be more complex than C).

lcn2 commented 2 years ago

As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:

         %x str

         %%
                 char string_buf[MAX_STR_CONST];
                 char *string_buf_ptr;

         \"      string_buf_ptr = string_buf; BEGIN(str);

         <str>\"        { /* saw closing quote - all done */
                 BEGIN(INITIAL);
                 *string_buf_ptr = '\0';
                 /* return string constant token type and
                  * value to parser
                  */
                 }

         <str>\n        {
                 /* error - unterminated string constant */
                 /* generate error message */
                 }

         <str>\\[0-7]{1,3} {
                 /* octal escape sequence */
                 int result;

                 (void) sscanf( yytext + 1, "%o", &result );

                 if ( result > 0xff )
                         /* error, constant is out-of-bounds */

                 *string_buf_ptr++ = result;
                 }

         <str>\\[0-9]+ {
                 /* generate error - bad escape sequence; something
                  * like '\48' or '\0777777'
                  */
                 }

         <str>\\n  *string_buf_ptr++ = '\n';
         <str>\\t  *string_buf_ptr++ = '\t';
         <str>\\r  *string_buf_ptr++ = '\r';
         <str>\\b  *string_buf_ptr++ = '\b';
         <str>\\f  *string_buf_ptr++ = '\f';

         <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];

         <str>[^\\\n\"]+        {
                 char *yptr = yytext;

                 while ( *yptr )
                         *string_buf_ptr++ = *yptr++;
                 }

Which deals with escape characters as well but also finds the final unescaped ". I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.

We see an issue with the above code as it does not match the JSON spec. JSON has very specific -escape characters for JSON encoded strings:

\"
\\
\/
\b
\f
\n
\r
\t
\uXXXX

In the \uXXXX case, X is an ASCII ]0-9a-fA-F] character, and there MUST 4 of them. The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.

You misunderstand. This was from the manual of flex on C strings. I was just saying that I am wondering if I will have to do something like this for JSON because of the complexities involved (which might be more complex than C).

Sorry (tm Canada :-) )

xexyl commented 2 years ago

As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:

         %x str

         %%
                 char string_buf[MAX_STR_CONST];
                 char *string_buf_ptr;

         \"      string_buf_ptr = string_buf; BEGIN(str);

         <str>\"        { /* saw closing quote - all done */
                 BEGIN(INITIAL);
                 *string_buf_ptr = '\0';
                 /* return string constant token type and
                  * value to parser
                  */
                 }

         <str>\n        {
                 /* error - unterminated string constant */
                 /* generate error message */
                 }

         <str>\\[0-7]{1,3} {
                 /* octal escape sequence */
                 int result;

                 (void) sscanf( yytext + 1, "%o", &result );

                 if ( result > 0xff )
                         /* error, constant is out-of-bounds */

                 *string_buf_ptr++ = result;
                 }

         <str>\\[0-9]+ {
                 /* generate error - bad escape sequence; something
                  * like '\48' or '\0777777'
                  */
                 }

         <str>\\n  *string_buf_ptr++ = '\n';
         <str>\\t  *string_buf_ptr++ = '\t';
         <str>\\r  *string_buf_ptr++ = '\r';
         <str>\\b  *string_buf_ptr++ = '\b';
         <str>\\f  *string_buf_ptr++ = '\f';

         <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];

         <str>[^\\\n\"]+        {
                 char *yptr = yytext;

                 while ( *yptr )
                         *string_buf_ptr++ = *yptr++;
                 }

Which deals with escape characters as well but also finds the final unescaped ". I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.

We see an issue with the above code as it does not match the JSON spec. JSON has very specific -escape characters for JSON encoded strings:

\"
\\
\/
\b
\f
\n
\r
\t
\uXXXX

In the \uXXXX case, X is an ASCII ]0-9a-fA-F] character, and there MUST 4 of them. The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.

You misunderstand. This was from the manual of flex on C strings. I was just saying that I am wondering if I will have to do something like this for JSON because of the complexities involved (which might be more complex than C).

Sorry (tm Canada :-) )

Easy enough to do when you're not perfect and sadly as my late grandfather said: 'There are hardly any of us perfect people left...' Well he didn't really believe it but it always made us laugh.

I'll be looking at this another day either way. I hope I can do more on this tomorrow but we'll see.

lcn2 commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

may be an empty string:

characters
    ""

This was been addressed in commit 92df12774f7f187fc37466776a3e4d96385f241e.

UPDATE: This is valid JSON:

"empty-JSON-strings-are-OK" : ""
xexyl commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

may be an empty string:

characters
    ""

This was been addressed in commit 92df127.

You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines.

If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.

lcn2 commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

...

characters
     ""
    '0020' . '10FFFF' - '"' - '\'

We interpret this to mean that JSON strings MUST -escape characters in this range:

[\x00-\x1F]

The way that they -escape depends on which characters. In some cases, such as \x09 there is a \t escape. In most cases the full \uXXXX must be used.

The code does this, however with commit 3477f4edd0afc7fc974dcb47f8548175b9000439 a comment is clarified.

lcn2 commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

may be an empty string:

characters
    ""

This was been addressed in commit 92df127.

You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines.

If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.

We think all you need to do is to indicate that JSON strings have zero or more character between the "'s.

We don't recall your original regex, but using * instead of + for the set of encodings between the enclosing "'s should do the trick.

lcn2 commented 2 years ago

Based on the JSON spec, this is valid JSON:

    "" : "why you would want to do this, is questionable, but it seems to be allowed by JSON"

I.e., the name in a JSON member can be empty for some odd reason.

lcn2 commented 2 years ago

In the JSON spec, we see that JSON strings may consist of the following:

string
    '"' characters '""

...

characters
     ""
    '0020' . '10FFFF' - '"' - '\'

This implies UTF-8.

So in JSON strings, this is allowed:

    "UTF-8-is-OK" : "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"

The malloc_json_encode() turns that above value into the following:

\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9

This is technically OK. However the malloc_json_decode() should not object to valid UTF-8 characters that are not \uXXXX escaped.

This seems to be a bug in malloc_json_decode() that needs to be addressed.

xexyl commented 2 years ago

Based on the JSON spec, this is valid JSON:

    "" : "why you would want to do this, is questionable, but it seems to be allowed by JSON"

I.e., the name in a JSON member can be empty for some odd reason.

That seems incredibly odd indeed and I think rather pointless too. It's true I'm not an expert in JSON but it seems it'd be like in C or some other language:

= 5;

...which is nonsense.

xexyl commented 2 years ago

In the JSON spec, we see that JSON strings may consist of the following:

string
    '"' characters '""

...

characters
     ""
    '0020' . '10FFFF' - '"' - '\'

This implies UTF-8.

So in JSON strings, this is allowed:

    "UTF-8-is-OK" : "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"

The malloc_json_encode() turns that above value into the following:

\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9

This is technically OK. However the malloc_json_decode() should not object to valid UTF-8 characters that are not \uXXXX escaped.

This seems to be a bug in malloc_json_decode() that needs to be addressed.

I suppose this also means that what I had in mind for the parser also has to change?

lcn2 commented 2 years ago

We are questioning the strict / -S mode.

While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep.

What would you think about removing strict mode altogether?

xexyl commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

...

characters
     ""
    '0020' . '10FFFF' - '"' - '\'

We interpret this to mean that JSON strings MUST -escape characters in this range:

[\x00-\x1F]

Interesting. That makes me doubt the regex I have which I'll include in reply to the other comment.

The way that they -escape depends on which characters. In some cases, such as \x09 there is a \t escape. In most cases the full \uXXXX must be used.

The code does this, however with commit 3477f4e a comment is clarified.

Good to know.

xexyl commented 2 years ago

We are questioning the strict / -S mode.

While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep.

What would you think about removing strict mode altogether?

Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).

lcn2 commented 2 years ago
= 5;

...which is nonsense.

Agreed.

Thankfully, this is not valid JSON:

{ : 5 }

while this is valid JSON for some reason:

{ "" : 5 }
xexyl commented 2 years ago
```c
= 5;

...which is nonsense.

Agreed.

I'm glad you agree. The burning question I have is why does the JSON spec not?

Thankfully, this is not valid JSON:

{ : 5 }

while this is valid JSON for some reason:

{ "" : 5 }

Very bizarre and I have to ask is there much of a difference even? I mean both have no name so what is the difference between an empty string and no string? The content is still of zero length.

xexyl commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

may be an empty string:

characters
    ""

This was been addressed in commit 92df127.

You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.

We think all you need to do is to indicate that JSON strings have zero or more character between the "'s.

We don't recall your original regex, but using * instead of + for the set of encodings between the enclosing "'s should do the trick.

This is what I actually have:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you.

EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.

xexyl commented 2 years ago
```c
```c
= 5;

...which is nonsense. Agreed.

I'm glad you agree. The burning question I have is why does the JSON spec not?

Thankfully, this is not valid JSON:

{ : 5 }

while this is valid JSON for some reason:

{ "" : 5 }

Very bizarre and I have to ask is there much of a difference even? I mean both have no name so what is the difference between an empty string and no string? The content is still of zero length.

Also for the parser what does this mean for name : value pairs? For the contest this won't matter but for others it might?

lcn2 commented 2 years ago

We are questioning the strict / -S mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removing strict mode altogether?

Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).

Right now, there is no harm in someone entering UTF-8 characters into mkiocccentry as malloc_json_encode_str() just encodes non-ASCII characters using \uXXXX.

The malloc_json_decode() processes characters as bytes and would pass then along as it found it.

Observe what the json tools do now:

$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9

$ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω

However:

$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3
Warning: main: error while encoding processing arg: 0

If we remove the strict mode, then that UTF-8 problem goes away.

xexyl commented 2 years ago

We are questioning the strict / -S mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removing strict mode altogether?

Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).

Right now, there is no harm in someone entering UTF-8 characters into mkiocccentry as malloc_json_encode_str() just encodes non-ASCII characters using \uXXXX.

The malloc_json_decode() processes characters as bytes and would pass then along as it found it.

Observe what the json tools do now:

$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9

$ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω

However:

$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3
Warning: main: error while encoding processing arg: 0

If we remove the strict mode, then that UTF-8 problem goes away.

I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from jinfochk and jauthchk as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final {/}s.

xexyl commented 2 years ago

We are questioning the strict / -S mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removing strict mode altogether?

Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).

Right now, there is no harm in someone entering UTF-8 characters into mkiocccentry as malloc_json_encode_str() just encodes non-ASCII characters using \uXXXX. The malloc_json_decode() processes characters as bytes and would pass then along as it found it. Observe what the json tools do now:

$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9

$ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω

However:

$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3
Warning: main: error while encoding processing arg: 0

If we remove the strict mode, then that UTF-8 problem goes away.

I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from jinfochk and jauthchk as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final {/}s.

A possibility is to allow for the option to be there but add an XXX note about why it's always false but this way it can be tested later on when everything is complete? The idea here: don't remove something that might possibly be needed until 100% sure it's not needed.

xexyl commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

may be an empty string:

characters
    ""

This was been addressed in commit 92df127.

You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.

We think all you need to do is to indicate that JSON strings have zero or more character between the "'s. We don't recall your original regex, but using * instead of + for the set of encodings between the enclosing "'s should do the trick.

This is what I actually have:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you.

EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.

Actually if you don't have any thoughts there I think I might start over by adding one part at a time to see where it fails. I could start doing that tomorrow most likely. Perhaps that's a better idea anyway as this is really complex and confusing.

lcn2 commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

may be an empty string:

characters
    ""

This was been addressed in commit 92df127.

You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.

We think all you need to do is to indicate that JSON strings have zero or more character between the "'s. We don't recall your original regex, but using * instead of + for the set of encodings between the enclosing "'s should do the trick.

This is what I actually have:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you.

EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.

We will address the larger issue in our next comment ...

BTW: That JTYPE_STRING might not be valid, at least perl says:

Invalid [] range "0-\x1F" in regex; marked by <-- HERE in m/("(((?=\)\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"\\x00-\x1F <-- HERE \x7F]+)*")/ at /Users/chongo/bench/sample/grep.pl line 46, <> line 1.

Perhaps this works:

JTYPE_STRING         "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"

If you recall the grep.pl script, the above regex allows for an empty string of the form:

""
xexyl commented 2 years ago

In the JSON spec, we see that JSON strings:

string
    '"' characters '""

may be an empty string:

characters
    ""

This was been addressed in commit 92df127.

You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.

We think all you need to do is to indicate that JSON strings have zero or more character between the "'s. We don't recall your original regex, but using * instead of + for the set of encodings between the enclosing "'s should do the trick.

This is what I actually have:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you. EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.

We will address the larger issue in our next comment ...

Thanks.

BTW: That JTYPE_STRING might not be valid, at least perl says:

Invalid [] range "0-\x1F" in regex; marked by <-- HERE in m/("(((?=\)\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"\\x00-\x1F <-- HERE \x7F]+)*")/ at /Users/chongo/bench/sample/grep.pl line 46, <> line 1.

Perhaps this works:

JTYPE_STRING         "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"

If you recall the grep.pl script, the above regex allows for an empty string of the form:

""

Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:

JTYPE_STRING    \"[^"\n]*\"

but JSON is more strict than that.

lcn2 commented 2 years ago

We are questioning the strict / -S mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removing strict mode altogether?

Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).

Right now, there is no harm in someone entering UTF-8 characters into mkiocccentry as malloc_json_encode_str() just encodes non-ASCII characters using \uXXXX. The malloc_json_decode() processes characters as bytes and would pass then along as it found it. Observe what the json tools do now:

$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9

$ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω

However:

$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3
Warning: main: error while encoding processing arg: 0

If we remove the strict mode, then that UTF-8 problem goes away.

I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from jinfochk and jauthchk as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final {/}s.

We are not sure of bison and flex handle UTF-8 characters. We suspect they are UTF-8 OK.

xexyl commented 2 years ago

We are questioning the strict / -S mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removing strict mode altogether?

Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).

Right now, there is no harm in someone entering UTF-8 characters into mkiocccentry as malloc_json_encode_str() just encodes non-ASCII characters using \uXXXX. The malloc_json_decode() processes characters as bytes and would pass then along as it found it. Observe what the json tools do now:

$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9

$ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω

However:

$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3
Warning: main: error while encoding processing arg: 0

If we remove the strict mode, then that UTF-8 problem goes away.

I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from jinfochk and jauthchk as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final {/}s.

We are not sure of bison and flex handle UTF-8 characters. We suspect they are UTF-8 OK.

I was actually wondering about that as well.

EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.

lcn2 commented 2 years ago

Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:

JTYPE_STRING    \"[^"\n]*\"

but JSON is more strict than that.

By still doesn't work do you mean that the regex of flex does not understand:

"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"

???

xexyl commented 2 years ago

Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:

JTYPE_STRING    \"[^"\n]*\"

but JSON is more strict than that.

By still doesn't work do you mean that the regex of flex does not understand:

"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"

???

Take the file:


{

"foo" : "bar"
}

Now running ./jparse on it will result in:

$ ./jparse  json 
debug[0]: Calling parse_json_file("json"):
debug[0]: *** BEGIN PARSE:
'

{

"foo" : "bar"
}

'
Starting parse
Entering state 0
Stack now 0
Reading a token

whitespace: '
'

open brace: '{'
Next token is token JTYPE_OPEN_BRACE ()
Shifting token JTYPE_OPEN_BRACE ()
Entering state 1
Stack now 0 1
Reading a token

whitespace: '

'
"foo"
whitespace: ' '

equals/colon: ':'
Next token is token JTYPE_COLON ()
LAC: initial context established for JTYPE_COLON
LAC: checking lookahead JTYPE_COLON: Err
Constructing syntax error message
LAC: checking lookahead end of file: Err
LAC: checking lookahead JTYPE_OPEN_BRACE: Err
LAC: checking lookahead JTYPE_CLOSE_BRACE: S14
LAC: checking lookahead JTYPE_OPEN_BRACKET: Err
LAC: checking lookahead JTYPE_CLOSE_BRACKET: Err
LAC: checking lookahead JTYPE_COMMA: Err
LAC: checking lookahead JTYPE_COLON: Err
LAC: checking lookahead JTYPE_NULL: Err
LAC: checking lookahead JTYPE_STRING: S15
LAC: checking lookahead JTYPE_UINTMAX: Err
LAC: checking lookahead JTYPE_INTMAX: Err
LAC: checking lookahead JTYPE_LONG_DOUBLE: Err
LAC: checking lookahead JTYPE_BOOLEAN: Err
JSON parser error (num errors: 1) on line 4: syntax error, unexpected JTYPE_COLON, expecting JTYPE_CLOSE_BRACE or JTYPE_STRING
Error: popping token JTYPE_OPEN_BRACE ()
Stack now 0
Cleanup: discarding lookahead token JTYPE_COLON ()
Stack now 0
debug[0]: *** END PARSE

So it doesn't see the string.

EDIT: Strings should be prefixed with string: btw.

I'm not sure if it's regex problem itself or a problem with flex not supporting it - that's making it harder to troubleshoot as well.

lcn2 commented 2 years ago

EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.

It is fine that Unicode is not OK. The UTF-8 works well.

xexyl commented 2 years ago

EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.

It is fine that Unicode is not OK. The UTF-8 works well.

That's good at least.

lcn2 commented 2 years ago

Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:

JTYPE_STRING    \"[^"\n]*\"

but JSON is more strict than that.

By still doesn't work do you mean that the regex of flex does not understand:

"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"

???

Take the file:


{

"foo" : "bar"
}

Now running ./jparse on it will result in:

$ ./jparse  json 
debug[0]: Calling parse_json_file("json"):
debug[0]: *** BEGIN PARSE:
'

{

"foo" : "bar"
}

'
Starting parse
Entering state 0
Stack now 0
Reading a token

whitespace: '
'

open brace: '{'
Next token is token JTYPE_OPEN_BRACE ()
Shifting token JTYPE_OPEN_BRACE ()
Entering state 1
Stack now 0 1
Reading a token

whitespace: '

'
"foo"
whitespace: ' '

equals/colon: ':'
Next token is token JTYPE_COLON ()
LAC: initial context established for JTYPE_COLON
LAC: checking lookahead JTYPE_COLON: Err
Constructing syntax error message
LAC: checking lookahead end of file: Err
LAC: checking lookahead JTYPE_OPEN_BRACE: Err
LAC: checking lookahead JTYPE_CLOSE_BRACE: S14
LAC: checking lookahead JTYPE_OPEN_BRACKET: Err
LAC: checking lookahead JTYPE_CLOSE_BRACKET: Err
LAC: checking lookahead JTYPE_COMMA: Err
LAC: checking lookahead JTYPE_COLON: Err
LAC: checking lookahead JTYPE_NULL: Err
LAC: checking lookahead JTYPE_STRING: S15
LAC: checking lookahead JTYPE_UINTMAX: Err
LAC: checking lookahead JTYPE_INTMAX: Err
LAC: checking lookahead JTYPE_LONG_DOUBLE: Err
LAC: checking lookahead JTYPE_BOOLEAN: Err
JSON parser error (num errors: 1) on line 4: syntax error, unexpected JTYPE_COLON, expecting JTYPE_CLOSE_BRACE or JTYPE_STRING
Error: popping token JTYPE_OPEN_BRACE ()
Stack now 0
Cleanup: discarding lookahead token JTYPE_COLON ()
Stack now 0
debug[0]: *** END PARSE

So it doesn't see the string.

EDIT: Strings should be prefixed with string: btw.

Why not use the simpler:

JTYPE_STRING    \"[^"\n]*\"

and let the malloc_json_decode() function (via calloc_json_conv_string() and calloc_json_conv_string_str()) do the work?

if the parser observes that converted is false, then it can raise an error.

xexyl commented 2 years ago

Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:

JTYPE_STRING    \"[^"\n]*\"

but JSON is more strict than that.

By still doesn't work do you mean that the regex of flex does not understand:

"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"

???

Take the file:


{

"foo" : "bar"
}

Now running ./jparse on it will result in:

$ ./jparse  json 
debug[0]: Calling parse_json_file("json"):
debug[0]: *** BEGIN PARSE:
'

{

"foo" : "bar"
}

'
Starting parse
Entering state 0
Stack now 0
Reading a token

whitespace: '
'

open brace: '{'
Next token is token JTYPE_OPEN_BRACE ()
Shifting token JTYPE_OPEN_BRACE ()
Entering state 1
Stack now 0 1
Reading a token

whitespace: '

'
"foo"
whitespace: ' '

equals/colon: ':'
Next token is token JTYPE_COLON ()
LAC: initial context established for JTYPE_COLON
LAC: checking lookahead JTYPE_COLON: Err
Constructing syntax error message
LAC: checking lookahead end of file: Err
LAC: checking lookahead JTYPE_OPEN_BRACE: Err
LAC: checking lookahead JTYPE_CLOSE_BRACE: S14
LAC: checking lookahead JTYPE_OPEN_BRACKET: Err
LAC: checking lookahead JTYPE_CLOSE_BRACKET: Err
LAC: checking lookahead JTYPE_COMMA: Err
LAC: checking lookahead JTYPE_COLON: Err
LAC: checking lookahead JTYPE_NULL: Err
LAC: checking lookahead JTYPE_STRING: S15
LAC: checking lookahead JTYPE_UINTMAX: Err
LAC: checking lookahead JTYPE_INTMAX: Err
LAC: checking lookahead JTYPE_LONG_DOUBLE: Err
LAC: checking lookahead JTYPE_BOOLEAN: Err
JSON parser error (num errors: 1) on line 4: syntax error, unexpected JTYPE_COLON, expecting JTYPE_CLOSE_BRACE or JTYPE_STRING
Error: popping token JTYPE_OPEN_BRACE ()
Stack now 0
Cleanup: discarding lookahead token JTYPE_COLON ()
Stack now 0
debug[0]: *** END PARSE

So it doesn't see the string. EDIT: Strings should be prefixed with string: btw.

Why not use the simpler:

JTYPE_STRING    \"[^"\n]*\"

and let the malloc_json_decode() function (via calloc_json_conv_string() and calloc_json_conv_string_str()) do the work?

if the parser observes that converted is false, then it can raise an error.

That's a great idea. I guess I forgot about that functionality. Thanks!

lcn2 commented 2 years ago

EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.

It is fine that Unicode is not OK. The UTF-8 works well.

By this we mean that using UTF-8 will work for the IOCCC. This presumes, however, that bison and flex will be OK with UTF-8, and if that this true, then all is well on this issue.

lcn2 commented 2 years ago

So, OK to remove strict mode and use of -S, @xexyl?

xexyl commented 2 years ago

EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.

It is fine that Unicode is not OK. The UTF-8 works well.

By this we mean that using UTF-8 will work for the IOCCC. This presumes, however, that bison and flex will be OK with UTF-8, and if that this true, then all is well on this issue.

Of course. I guess we should verify for sure that both flex and bison work with UTF-8.

xexyl commented 2 years ago

So, OK to remove strict mode and use of -S, @xexyl?

I think so but perhaps the strict parsing could be commented out instead of removed outright for documentation purposes as well as if it's ever needed? Not sure: I only bring it up as an idea that might or might not have value. Either way I think removing it is good too.

lcn2 commented 2 years ago

For now, the fact that malloc_json_encode() encodes non-ASCII bytes as \uXXXX could be considered lazy coding. While it would produce encoded strings that might be longer than needed, it does not violate JSON.

The malloc_json_decode() in non-strict mode does not object to non-ASCII bytes, so that is UTF-8 OK as well.

The strict mode was mainly to enforce the suggestions by this blog post, which are not needed.

xexyl commented 2 years ago

I was about to commit this but something occurred to me.

Why not use the simpler:

JTYPE_STRING    \"[^"\n]*\"

and let the malloc_json_decode() function (via calloc_json_conv_string() and calloc_json_conv_string_str()) do the work? if the parser observes that converted is false, then it can raise an error.

That's a great idea. I guess I forgot about that functionality. Thanks!

Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:

JTYPE_STRING            \"[^\n]*\"

I.e. removed the exclusion of ". But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.

xexyl commented 2 years ago

For now, the fact that malloc_json_encode() encodes non-ASCII bytes as \uXXXX could be considered lazy coding. While it would produce encoded strings that might be longer than needed, it does not violate JSON.

The malloc_json_decode() in non-strict mode does not object to non-ASCII bytes, so that is UTF-8 OK as well.

The strict mode was mainly to enforce the suggestions by this blog post, which are not needed.

Sounds good.

xexyl commented 2 years ago

For now, the fact that malloc_json_encode() encodes non-ASCII bytes as \uXXXX could be considered lazy coding. While it would produce encoded strings that might be longer than needed, it does not violate JSON. The malloc_json_decode() in non-strict mode does not object to non-ASCII bytes, so that is UTF-8 OK as well. The strict mode was mainly to enforce the suggestions by this blog post, which are not needed.

Sounds good.

I'll remove the -S option from the jinfochk and jauthchk tools later on. That very possibly will be tomorrow as it shouldn't take much effort.

With a bit of luck I will be more clear headed and alert and then I can also work out the string regex too by testing different inputs.

EDIT: The man pages will have to be updated as well: just remembered this.

lcn2 commented 2 years ago

I was about to commit this but something occurred to me.

Why not use the simpler:

JTYPE_STRING    \"[^"\n]*\"

and let the malloc_json_decode() function (via calloc_json_conv_string() and calloc_json_conv_string_str()) do the work? if the parser observes that converted is false, then it can raise an error.

That's a great idea. I guess I forgot about that functionality. Thanks!

Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:

JTYPE_STRING            \"[^\n]*\"

I.e. removed the exclusion of ". But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.

The issue is how to get the parser to process this as a single JSON string:

"this is \"OK\" in JSON"

Humm ....

xexyl commented 2 years ago

I was about to commit this but something occurred to me.

Why not use the simpler:

JTYPE_STRING    \"[^"\n]*\"

and let the malloc_json_decode() function (via calloc_json_conv_string() and calloc_json_conv_string_str()) do the work? if the parser observes that converted is false, then it can raise an error.

That's a great idea. I guess I forgot about that functionality. Thanks!

Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:

JTYPE_STRING            \"[^\n]*\"

I.e. removed the exclusion of ". But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.

The issue is how to get the parser to process this as a single JSON string:

"this is \"OK\" in JSON"

Humm ....

That's actually one of the things I had thought of earlier on. I'm not sure how to address it. At least not yet.

lcn2 commented 2 years ago

The flex man page reads:

       -8, --8bit
              generate 8-bit scanner

Perhaps -8 is need to be given via the Makefile to flex?

xexyl commented 2 years ago

The flex man page reads:

       -8, --8bit
              generate 8-bit scanner

Perhaps -8 is need to be given via the Makefile to flex?

Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?