Closed xexyl closed 1 year ago
Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be
zeroed
orcleared
orempty
or something like that.What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.
There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running
calloc()
or usingmemset()
to zeroize sections of memory causes massive churn in the memory / VM system.Sorry. I meant clearer. I was talking about the word only.
So the dynamic array facility won't zeroize by default.
TODO: We need to add macros to
dyn_alloc.h
for backward compatibility, BTW.Sounds good.
Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.
Good point. But still could be
zeroed
or something like that. I'm only talking about using words that are in the dictionary. Or iszeroi[sz]e
actually a term that I'm computer unfamiliar with? I've seen different ways of saying the same thing but never this one.We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)
And that's what I'm getting at exactly. As you're sure to know I don't use Merriam Webster - I use OED and if I didn't use that I would use another British English one - so I've never heard of it. More to the point though I was thinking of this for vim spelling more than anything else - though other reasons apply as well.
Most dictionaries lag behind common usage: they attempt to describe (instead of proscribe) the language on the date of their publication.
So not having the zeroize in a zeroize definition for such a term of art such as zeroize is not surprising. :-)
As a logophile who's read the dictionary since childhood (and constantly loses track of time because of getting lost in the dictionary) I'm aware of how lexicographers work, I can assure you :)
As a logophile who's read the dictionary since childhood (and constantly loses track of time because of getting lost in the dictionary) I'm aware of how lexicographers work, I can assure you :)
That being said I am thinking that most system dictionaries won't have it either. Perhaps this should be discussed in the preparing for an official release thread though. Or even another thread on language: I don't know.
As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:
%x str
%%
char string_buf[MAX_STR_CONST];
char *string_buf_ptr;
\" string_buf_ptr = string_buf; BEGIN(str);
<str>\" { /* saw closing quote - all done */
BEGIN(INITIAL);
*string_buf_ptr = '\0';
/* return string constant token type and
* value to parser
*/
}
<str>\n {
/* error - unterminated string constant */
/* generate error message */
}
<str>\\[0-7]{1,3} {
/* octal escape sequence */
int result;
(void) sscanf( yytext + 1, "%o", &result );
if ( result > 0xff )
/* error, constant is out-of-bounds */
*string_buf_ptr++ = result;
}
<str>\\[0-9]+ {
/* generate error - bad escape sequence; something
* like '\48' or '\0777777'
*/
}
<str>\\n *string_buf_ptr++ = '\n';
<str>\\t *string_buf_ptr++ = '\t';
<str>\\r *string_buf_ptr++ = '\r';
<str>\\b *string_buf_ptr++ = '\b';
<str>\\f *string_buf_ptr++ = '\f';
<str>\\(.|\n) *string_buf_ptr++ = yytext[1];
<str>[^\\\n\"]+ {
char *yptr = yytext;
while ( *yptr )
*string_buf_ptr++ = *yptr++;
}
Which deals with escape characters as well but also finds the final unescaped "
. I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.
Which would you prefer, adding a
strip
boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding thestrip
boolean as a final argument to those functions would be cleaner.I think it would be cleaner to add it to the function as well yes.
With commit eed3230e2be51a53f454ad25a00378490619c786:
Added quote arg to JSON string conversion
Added quote arg to:
extern struct json *calloc_json_conv_string(char const *str, size_t len, bool strict, bool quote);
extern calloc_json_conv_string_str(char const *str, size_t *retlen, bool strict, bool quote);
As in:
/*
...
* quote true ==> ignore JSON double quotes, both str[0] & str[len-1] must be "
* false ==> the entire str is to be converted
...
*/
If calloc_json_conv_string() or calloc_json_conv_string_str() is called
with quote == true, then if str is not enclosed in JSON double-quotes,
a warning will be issued AND conversion will not performed AND
converted will be set to false.
We hope this helps, @xexyl.
Which would you prefer, adding a
strip
boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding thestrip
boolean as a final argument to those functions would be cleaner.I think it would be cleaner to add it to the function as well yes.
With commit eed3230:
Added quote arg to JSON string conversion Added quote arg to: extern struct json *calloc_json_conv_string(char const *str, size_t len, bool strict, bool quote); extern calloc_json_conv_string_str(char const *str, size_t *retlen, bool strict, bool quote); As in: /* ... * quote true ==> ignore JSON double quotes, both str[0] & str[len-1] must be " * false ==> the entire str is to be converted ... */ If calloc_json_conv_string() or calloc_json_conv_string_str() is called with quote == true, then if str is not enclosed in JSON double-quotes, a warning will be issued AND conversion will not performed AND converted will be set to false.
We hope this helps, @xexyl.
Sounds good and it will be of help when I get to that point. Thank you!
As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:
%x str %% char string_buf[MAX_STR_CONST]; char *string_buf_ptr; \" string_buf_ptr = string_buf; BEGIN(str); <str>\" { /* saw closing quote - all done */ BEGIN(INITIAL); *string_buf_ptr = '\0'; /* return string constant token type and * value to parser */ } <str>\n { /* error - unterminated string constant */ /* generate error message */ } <str>\\[0-7]{1,3} { /* octal escape sequence */ int result; (void) sscanf( yytext + 1, "%o", &result ); if ( result > 0xff ) /* error, constant is out-of-bounds */ *string_buf_ptr++ = result; } <str>\\[0-9]+ { /* generate error - bad escape sequence; something * like '\48' or '\0777777' */ } <str>\\n *string_buf_ptr++ = '\n'; <str>\\t *string_buf_ptr++ = '\t'; <str>\\r *string_buf_ptr++ = '\r'; <str>\\b *string_buf_ptr++ = '\b'; <str>\\f *string_buf_ptr++ = '\f'; <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; <str>[^\\\n\"]+ { char *yptr = yytext; while ( *yptr ) *string_buf_ptr++ = *yptr++; }
Which deals with escape characters as well but also finds the final unescaped
"
. I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.
We see an issue with the above code as it does not match the JSON spec.
JSON has very specific -escape characters for JSON encoded strings:
\"
\\
\/
\b
\f
\n
\r
\t
\uXXXX
In the \uXXXX
case, X
is an ASCII ]0-9a-fA-F]
character, and there MUST 4 of them.
The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.
As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:
%x str %% char string_buf[MAX_STR_CONST]; char *string_buf_ptr; \" string_buf_ptr = string_buf; BEGIN(str); <str>\" { /* saw closing quote - all done */ BEGIN(INITIAL); *string_buf_ptr = '\0'; /* return string constant token type and * value to parser */ } <str>\n { /* error - unterminated string constant */ /* generate error message */ } <str>\\[0-7]{1,3} { /* octal escape sequence */ int result; (void) sscanf( yytext + 1, "%o", &result ); if ( result > 0xff ) /* error, constant is out-of-bounds */ *string_buf_ptr++ = result; } <str>\\[0-9]+ { /* generate error - bad escape sequence; something * like '\48' or '\0777777' */ } <str>\\n *string_buf_ptr++ = '\n'; <str>\\t *string_buf_ptr++ = '\t'; <str>\\r *string_buf_ptr++ = '\r'; <str>\\b *string_buf_ptr++ = '\b'; <str>\\f *string_buf_ptr++ = '\f'; <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; <str>[^\\\n\"]+ { char *yptr = yytext; while ( *yptr ) *string_buf_ptr++ = *yptr++; }
Which deals with escape characters as well but also finds the final unescaped
"
. I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.We see an issue with the above code as it does not match the JSON spec.
JSON has very specific -escape characters for JSON encoded strings:
\" \\ \/ \b \f \n \r \t \uXXXX
In the
\uXXXX
case,X
is an ASCII]0-9a-fA-F]
character, and there MUST 4 of them.The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.
You misunderstand. This was from the manual of flex on C strings. I was just saying that I am wondering if I will have to do something like this for JSON because of the complexities involved (which might be more complex than C).
As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:
%x str %% char string_buf[MAX_STR_CONST]; char *string_buf_ptr; \" string_buf_ptr = string_buf; BEGIN(str); <str>\" { /* saw closing quote - all done */ BEGIN(INITIAL); *string_buf_ptr = '\0'; /* return string constant token type and * value to parser */ } <str>\n { /* error - unterminated string constant */ /* generate error message */ } <str>\\[0-7]{1,3} { /* octal escape sequence */ int result; (void) sscanf( yytext + 1, "%o", &result ); if ( result > 0xff ) /* error, constant is out-of-bounds */ *string_buf_ptr++ = result; } <str>\\[0-9]+ { /* generate error - bad escape sequence; something * like '\48' or '\0777777' */ } <str>\\n *string_buf_ptr++ = '\n'; <str>\\t *string_buf_ptr++ = '\t'; <str>\\r *string_buf_ptr++ = '\r'; <str>\\b *string_buf_ptr++ = '\b'; <str>\\f *string_buf_ptr++ = '\f'; <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; <str>[^\\\n\"]+ { char *yptr = yytext; while ( *yptr ) *string_buf_ptr++ = *yptr++; }
Which deals with escape characters as well but also finds the final unescaped
"
. I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.We see an issue with the above code as it does not match the JSON spec. JSON has very specific -escape characters for JSON encoded strings:
\" \\ \/ \b \f \n \r \t \uXXXX
In the
\uXXXX
case,X
is an ASCII]0-9a-fA-F]
character, and there MUST 4 of them. The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.You misunderstand. This was from the manual of flex on C strings. I was just saying that I am wondering if I will have to do something like this for JSON because of the complexities involved (which might be more complex than C).
Sorry (tm Canada :-) )
As for string regex I'm starting to wonder if I need to use a more elaborate way to go about it. Perhaps it doesn't work because flex's regex doesn't support all the features used in the regex. The book doesn't show all these features for example. There might be more in a chapter I haven't read yet of course but I see in the manual the following way to capture C strings:
%x str %% char string_buf[MAX_STR_CONST]; char *string_buf_ptr; \" string_buf_ptr = string_buf; BEGIN(str); <str>\" { /* saw closing quote - all done */ BEGIN(INITIAL); *string_buf_ptr = '\0'; /* return string constant token type and * value to parser */ } <str>\n { /* error - unterminated string constant */ /* generate error message */ } <str>\\[0-7]{1,3} { /* octal escape sequence */ int result; (void) sscanf( yytext + 1, "%o", &result ); if ( result > 0xff ) /* error, constant is out-of-bounds */ *string_buf_ptr++ = result; } <str>\\[0-9]+ { /* generate error - bad escape sequence; something * like '\48' or '\0777777' */ } <str>\\n *string_buf_ptr++ = '\n'; <str>\\t *string_buf_ptr++ = '\t'; <str>\\r *string_buf_ptr++ = '\r'; <str>\\b *string_buf_ptr++ = '\b'; <str>\\f *string_buf_ptr++ = '\f'; <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; <str>[^\\\n\"]+ { char *yptr = yytext; while ( *yptr ) *string_buf_ptr++ = *yptr++; }
Which deals with escape characters as well but also finds the final unescaped
"
. I don't really know. I might have to do a bit more research on it. That being said I'm done with this for the day. Hopefully I can look at it more tomorrow or the next day if not tomorrow. Now (as in currently) would be a good time to work on this since the other features are not yet finished to complete the rest of the parser. I only just thought of that so hopefully tomorrow I can get at it. I'd like it to be a single regex but from what I've seen in the book as well as the manual it might not be possible so simply.We see an issue with the above code as it does not match the JSON spec. JSON has very specific -escape characters for JSON encoded strings:
\" \\ \/ \b \f \n \r \t \uXXXX
In the
\uXXXX
case,X
is an ASCII]0-9a-fA-F]
character, and there MUST 4 of them. The point is that JSON only allows certain -escape sequences in JSON encoded strings, and it differs from C syntax. All other -escapes are invalid, including C style -0octal and C style -x.You misunderstand. This was from the manual of flex on C strings. I was just saying that I am wondering if I will have to do something like this for JSON because of the complexities involved (which might be more complex than C).
Sorry (tm Canada :-) )
Easy enough to do when you're not perfect and sadly as my late grandfather said: 'There are hardly any of us perfect people left...' Well he didn't really believe it but it always made us laugh.
I'll be looking at this another day either way. I hope I can do more on this tomorrow but we'll see.
In the JSON spec, we see that JSON strings:
string
'"' characters '""
may be an empty string:
characters
""
This was been addressed in commit 92df12774f7f187fc37466776a3e4d96385f241e.
UPDATE: This is valid JSON:
"empty-JSON-strings-are-OK" : ""
In the JSON spec, we see that JSON strings:
string '"' characters '""
may be an empty string:
characters ""
This was been addressed in commit 92df127.
You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines.
If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.
In the JSON spec, we see that JSON strings:
string
'"' characters '""
...
characters
""
'0020' . '10FFFF' - '"' - '\'
We interpret this to mean that JSON strings MUST -escape characters in this range:
[\x00-\x1F]
The way that they -escape depends on which characters. In some cases, such as \x09
there is a \t
escape. In most cases the full \uXXXX
must be used.
The code does this, however with commit 3477f4edd0afc7fc974dcb47f8548175b9000439 a comment is clarified.
In the JSON spec, we see that JSON strings:
string '"' characters '""
may be an empty string:
characters ""
This was been addressed in commit 92df127.
You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines.
If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.
We think all you need to do is to indicate that JSON strings have zero or more character between the "'s.
We don't recall your original regex, but using *
instead of +
for the set of encodings between the enclosing "'s should do the trick.
Based on the JSON spec, this is valid JSON:
"" : "why you would want to do this, is questionable, but it seems to be allowed by JSON"
I.e., the name in a JSON member can be empty for some odd reason.
In the JSON spec, we see that JSON strings may consist of the following:
string
'"' characters '""
...
characters
""
'0020' . '10FFFF' - '"' - '\'
This implies UTF-8.
So in JSON strings, this is allowed:
"UTF-8-is-OK" : "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
The malloc_json_encode()
turns that above value into the following:
\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9
This is technically OK. However the malloc_json_decode()
should not object to valid UTF-8 characters that are not \uXXXX
escaped.
This seems to be a bug in malloc_json_decode() that needs to be addressed.
Based on the JSON spec, this is valid JSON:
"" : "why you would want to do this, is questionable, but it seems to be allowed by JSON"
I.e., the name in a JSON member can be empty for some odd reason.
That seems incredibly odd indeed and I think rather pointless too. It's true I'm not an expert in JSON but it seems it'd be like in C or some other language:
= 5;
...which is nonsense.
In the JSON spec, we see that JSON strings may consist of the following:
string '"' characters '"" ... characters "" '0020' . '10FFFF' - '"' - '\'
This implies UTF-8.
So in JSON strings, this is allowed:
"UTF-8-is-OK" : "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
The
malloc_json_encode()
turns that above value into the following:\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9
This is technically OK. However the
malloc_json_decode()
should not object to valid UTF-8 characters that are not\uXXXX
escaped.This seems to be a bug in malloc_json_decode() that needs to be addressed.
I suppose this also means that what I had in mind for the parser also has to change?
We are questioning the strict
/ -S
mode.
While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep.
What would you think about removing strict
mode altogether?
In the JSON spec, we see that JSON strings:
string '"' characters '"" ... characters "" '0020' . '10FFFF' - '"' - '\'
We interpret this to mean that JSON strings MUST -escape characters in this range:
[\x00-\x1F]
Interesting. That makes me doubt the regex I have which I'll include in reply to the other comment.
The way that they -escape depends on which characters. In some cases, such as
\x09
there is a\t
escape. In most cases the full\uXXXX
must be used.The code does this, however with commit 3477f4e a comment is clarified.
Good to know.
We are questioning the
strict
/-S
mode.While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep.
What would you think about removing
strict
mode altogether?
Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).
= 5;
...which is nonsense.
Agreed.
Thankfully, this is not valid JSON:
{ : 5 }
while this is valid JSON for some reason:
{ "" : 5 }
```c = 5;
...which is nonsense.
Agreed.
I'm glad you agree. The burning question I have is why does the JSON spec not?
Thankfully, this is not valid JSON:
{ : 5 }
while this is valid JSON for some reason:
{ "" : 5 }
Very bizarre and I have to ask is there much of a difference even? I mean both have no name so what is the difference between an empty string and no string? The content is still of zero length.
In the JSON spec, we see that JSON strings:
string '"' characters '""
may be an empty string:
characters ""
This was been addressed in commit 92df127.
You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.
We think all you need to do is to indicate that JSON strings have zero or more character between the "'s.
We don't recall your original regex, but using
*
instead of+
for the set of encodings between the enclosing "'s should do the trick.
This is what I actually have:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"
Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you.
EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.
```c ```c = 5;
...which is nonsense. Agreed.
I'm glad you agree. The burning question I have is why does the JSON spec not?
Thankfully, this is not valid JSON:
{ : 5 }
while this is valid JSON for some reason:
{ "" : 5 }
Very bizarre and I have to ask is there much of a difference even? I mean both have no name so what is the difference between an empty string and no string? The content is still of zero length.
Also for the parser what does this mean for name : value pairs? For the contest this won't matter but for others it might?
We are questioning the
strict
/-S
mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removingstrict
mode altogether?Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).
Right now, there is no harm in someone entering UTF-8 characters into mkiocccentry
as malloc_json_encode_str()
just encodes non-ASCII characters using \uXXXX.
The malloc_json_decode()
processes characters as bytes and would pass then along as it found it.
Observe what the json tools do now:
$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
\u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9
$ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω
However:
$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω"
Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3
Warning: main: error while encoding processing arg: 0
If we remove the strict
mode, then that UTF-8 problem goes away.
We are questioning the
strict
/-S
mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removingstrict
mode altogether?Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).
Right now, there is no harm in someone entering UTF-8 characters into
mkiocccentry
asmalloc_json_encode_str()
just encodes non-ASCII characters using \uXXXX.The
malloc_json_decode()
processes characters as bytes and would pass then along as it found it.Observe what the json tools do now:
$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" \u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9 $ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω
However:
$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3 Warning: main: error while encoding processing arg: 0
If we remove the
strict
mode, then that UTF-8 problem goes away.
I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from jinfochk
and jauthchk
as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final {
/}
s.
We are questioning the
strict
/-S
mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removingstrict
mode altogether?Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).
Right now, there is no harm in someone entering UTF-8 characters into
mkiocccentry
asmalloc_json_encode_str()
just encodes non-ASCII characters using \uXXXX. Themalloc_json_decode()
processes characters as bytes and would pass then along as it found it. Observe what the json tools do now:$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" \u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9 $ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω
However:
$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3 Warning: main: error while encoding processing arg: 0
If we remove the
strict
mode, then that UTF-8 problem goes away.I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from
jinfochk
andjauthchk
as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final{
/}
s.
A possibility is to allow for the option to be there but add an XXX
note about why it's always false but this way it can be tested later on when everything is complete? The idea here: don't remove something that might possibly be needed until 100% sure it's not needed.
In the JSON spec, we see that JSON strings:
string '"' characters '""
may be an empty string:
characters ""
This was been addressed in commit 92df127.
You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.
We think all you need to do is to indicate that JSON strings have zero or more character between the "'s. We don't recall your original regex, but using
*
instead of+
for the set of encodings between the enclosing "'s should do the trick.This is what I actually have:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"
Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you.
EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.
Actually if you don't have any thoughts there I think I might start over by adding one part at a time to see where it fails. I could start doing that tomorrow most likely. Perhaps that's a better idea anyway as this is really complex and confusing.
In the JSON spec, we see that JSON strings:
string '"' characters '""
may be an empty string:
characters ""
This was been addressed in commit 92df127.
You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.
We think all you need to do is to indicate that JSON strings have zero or more character between the "'s. We don't recall your original regex, but using
*
instead of+
for the set of encodings between the enclosing "'s should do the trick.This is what I actually have:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"
Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you.
EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.
We will address the larger issue in our next comment ...
BTW: That JTYPE_STRING
might not be valid, at least perl says:
Invalid [] range "0-\x1F" in regex; marked by <-- HERE in m/("(((?=\)\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"\\x00-\x1F <-- HERE \x7F]+)*")/ at /Users/chongo/bench/sample/grep.pl line 46, <> line 1.
Perhaps this works:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"
If you recall the grep.pl
script, the above regex allows for an empty string of the form:
""
In the JSON spec, we see that JSON strings:
string '"' characters '""
may be an empty string:
characters ""
This was been addressed in commit 92df127.
You know I saw that too but it never registered as a problem to me for some reason: I was focusing on the lexer/parser and not the decoding routines. If you have any idea of how a regex might work or if you see anything wrong with the current one I'm definitely open to suggestions. I think I'll have to brush up on more advanced regex and see if I can determine if flex actually supports these features.
We think all you need to do is to indicate that JSON strings have zero or more character between the "'s. We don't recall your original regex, but using
*
instead of+
for the set of encodings between the enclosing "'s should do the trick.This is what I actually have:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"
Any suggestions? I suppose what you discovered a short bit ago about the unicode/escaped chars might change it if nothing else does (besides it not working right that is). Thank you. EDIT: I did start adding some extra tokens to make it easier to parse the regex but I haven't used them yet.
We will address the larger issue in our next comment ...
Thanks.
BTW: That
JTYPE_STRING
might not be valid, at least perl says:Invalid [] range "0-\x1F" in regex; marked by <-- HERE in m/("(((?=\)\(["\\/bfnrt]|u[0-9a-fA-F]{4}))|[^"\\x00-\x1F <-- HERE \x7F]+)*")/ at /Users/chongo/bench/sample/grep.pl line 46, <> line 1.
Perhaps this works:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"
If you recall the
grep.pl
script, the above regex allows for an empty string of the form:""
Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:
JTYPE_STRING \"[^"\n]*\"
but JSON is more strict than that.
We are questioning the
strict
/-S
mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removingstrict
mode altogether?Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).
Right now, there is no harm in someone entering UTF-8 characters into
mkiocccentry
asmalloc_json_encode_str()
just encodes non-ASCII characters using \uXXXX. Themalloc_json_decode()
processes characters as bytes and would pass then along as it found it. Observe what the json tools do now:$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" \u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9 $ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω
However:
$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3 Warning: main: error while encoding processing arg: 0
If we remove the
strict
mode, then that UTF-8 problem goes away.I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from
jinfochk
andjauthchk
as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final{
/}
s.
We are not sure of bison
and flex
handle UTF-8 characters. We suspect they are UTF-8 OK.
We are questioning the
strict
/-S
mode. While this blog entry suggests additional encoding for languages such as go and the web, we don't need to consider for this rep. What would you think about removingstrict
mode altogether?Interesting thought. Would this have any effect when making the json parser separate (after the IOCCCMOCK goes as planned)? If not I don't see why it would be needed and removing it would simplify the code which is always a good thing (well okay - unless you're submitting an IOCCC entry maybe).
Right now, there is no harm in someone entering UTF-8 characters into
mkiocccentry
asmalloc_json_encode_str()
just encodes non-ASCII characters using \uXXXX. Themalloc_json_decode()
processes characters as bytes and would pass then along as it found it. Observe what the json tools do now:$ ./jstrencode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" \u00c3\u00a5\u00e2\u0088\u00ab\u00c3\u00a7\u00e2\u0088\u0082\u00c2\u00b4\u00c6\u0092\u00c2\u00a9\u00cb\u0099\u00cb\u0086\u00e2\u0088\u0086\u00cb\u009a\u00c2\u00ac\u00c2\u00b5\u00cb\u009c\u00c3\u00b8\u00cf\u0080\u00c5\u0093\u00c2\u00ae\u00c3\u009f\u00e2\u0080\u00a0\u00c2\u00a8\u00e2\u0088\u009a\u00e2\u0088\u0091\u00e2\u0089\u0088\u00c2\u00a5\u00ce\u00a9 $ ./jstrdecode "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω
However:
$ ./jstrdecode -S "å∫ç∂´ƒ©˙ˆ∆˚¬µ˜øπœ®ß†¨√∑≈¥Ω" Warning: malloc_json_decode: strict encoding at 0 found unescaped char: 0xc3 Warning: main: error while encoding processing arg: 0
If we remove the
strict
mode, then that UTF-8 problem goes away.I guess it should be removed then? It would simplify the code and as long as it won't have any effect on the parser I don't see why it should be there. I can in time remove the strict option from
jinfochk
andjauthchk
as I have a better idea what it looks like though as I recall it's only used in one place right now - checking for initial and final{
/}
s.We are not sure of
bison
andflex
handle UTF-8 characters. We suspect they are UTF-8 OK.
I was actually wondering about that as well.
EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.
Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:
JTYPE_STRING \"[^"\n]*\"
but JSON is more strict than that.
By still doesn't work do you mean that the regex of flex
does not understand:
"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"
???
Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:
JTYPE_STRING \"[^"\n]*\"
but JSON is more strict than that.
By still doesn't work do you mean that the regex of
flex
does not understand:"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"
???
Take the file:
{
"foo" : "bar"
}
Now running ./jparse
on it will result in:
$ ./jparse json
debug[0]: Calling parse_json_file("json"):
debug[0]: *** BEGIN PARSE:
'
{
"foo" : "bar"
}
'
Starting parse
Entering state 0
Stack now 0
Reading a token
whitespace: '
'
open brace: '{'
Next token is token JTYPE_OPEN_BRACE ()
Shifting token JTYPE_OPEN_BRACE ()
Entering state 1
Stack now 0 1
Reading a token
whitespace: '
'
"foo"
whitespace: ' '
equals/colon: ':'
Next token is token JTYPE_COLON ()
LAC: initial context established for JTYPE_COLON
LAC: checking lookahead JTYPE_COLON: Err
Constructing syntax error message
LAC: checking lookahead end of file: Err
LAC: checking lookahead JTYPE_OPEN_BRACE: Err
LAC: checking lookahead JTYPE_CLOSE_BRACE: S14
LAC: checking lookahead JTYPE_OPEN_BRACKET: Err
LAC: checking lookahead JTYPE_CLOSE_BRACKET: Err
LAC: checking lookahead JTYPE_COMMA: Err
LAC: checking lookahead JTYPE_COLON: Err
LAC: checking lookahead JTYPE_NULL: Err
LAC: checking lookahead JTYPE_STRING: S15
LAC: checking lookahead JTYPE_UINTMAX: Err
LAC: checking lookahead JTYPE_INTMAX: Err
LAC: checking lookahead JTYPE_LONG_DOUBLE: Err
LAC: checking lookahead JTYPE_BOOLEAN: Err
JSON parser error (num errors: 1) on line 4: syntax error, unexpected JTYPE_COLON, expecting JTYPE_CLOSE_BRACE or JTYPE_STRING
Error: popping token JTYPE_OPEN_BRACE ()
Stack now 0
Cleanup: discarding lookahead token JTYPE_COLON ()
Stack now 0
debug[0]: *** END PARSE
So it doesn't see the string.
EDIT: Strings should be prefixed with string:
btw.
I'm not sure if it's regex problem itself or a problem with flex not supporting it - that's making it harder to troubleshoot as well.
EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.
It is fine that Unicode is not OK. The UTF-8 works well.
EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.
It is fine that Unicode is not OK. The UTF-8 works well.
That's good at least.
Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:
JTYPE_STRING \"[^"\n]*\"
but JSON is more strict than that.
By still doesn't work do you mean that the regex of
flex
does not understand:"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"
???
Take the file:
{ "foo" : "bar" }
Now running
./jparse
on it will result in:$ ./jparse json debug[0]: Calling parse_json_file("json"): debug[0]: *** BEGIN PARSE: ' { "foo" : "bar" } ' Starting parse Entering state 0 Stack now 0 Reading a token whitespace: ' ' open brace: '{' Next token is token JTYPE_OPEN_BRACE () Shifting token JTYPE_OPEN_BRACE () Entering state 1 Stack now 0 1 Reading a token whitespace: ' ' "foo" whitespace: ' ' equals/colon: ':' Next token is token JTYPE_COLON () LAC: initial context established for JTYPE_COLON LAC: checking lookahead JTYPE_COLON: Err Constructing syntax error message LAC: checking lookahead end of file: Err LAC: checking lookahead JTYPE_OPEN_BRACE: Err LAC: checking lookahead JTYPE_CLOSE_BRACE: S14 LAC: checking lookahead JTYPE_OPEN_BRACKET: Err LAC: checking lookahead JTYPE_CLOSE_BRACKET: Err LAC: checking lookahead JTYPE_COMMA: Err LAC: checking lookahead JTYPE_COLON: Err LAC: checking lookahead JTYPE_NULL: Err LAC: checking lookahead JTYPE_STRING: S15 LAC: checking lookahead JTYPE_UINTMAX: Err LAC: checking lookahead JTYPE_INTMAX: Err LAC: checking lookahead JTYPE_LONG_DOUBLE: Err LAC: checking lookahead JTYPE_BOOLEAN: Err JSON parser error (num errors: 1) on line 4: syntax error, unexpected JTYPE_COLON, expecting JTYPE_CLOSE_BRACE or JTYPE_STRING Error: popping token JTYPE_OPEN_BRACE () Stack now 0 Cleanup: discarding lookahead token JTYPE_COLON () Stack now 0 debug[0]: *** END PARSE
So it doesn't see the string.
EDIT: Strings should be prefixed with
string:
btw.
Why not use the simpler:
JTYPE_STRING \"[^"\n]*\"
and let the malloc_json_decode()
function (via calloc_json_conv_string()
and calloc_json_conv_string_str()
) do the work?
if the parser observes that converted
is false
, then it can raise an error.
Unfortunately it still doesn't work. I'm not sure what part of it is a problem. I initially had:
JTYPE_STRING \"[^"\n]*\"
but JSON is more strict than that.
By still doesn't work do you mean that the regex of
flex
does not understand:"(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\x00-\x1F\x7F]+)*\")"
???
Take the file:
{ "foo" : "bar" }
Now running
./jparse
on it will result in:$ ./jparse json debug[0]: Calling parse_json_file("json"): debug[0]: *** BEGIN PARSE: ' { "foo" : "bar" } ' Starting parse Entering state 0 Stack now 0 Reading a token whitespace: ' ' open brace: '{' Next token is token JTYPE_OPEN_BRACE () Shifting token JTYPE_OPEN_BRACE () Entering state 1 Stack now 0 1 Reading a token whitespace: ' ' "foo" whitespace: ' ' equals/colon: ':' Next token is token JTYPE_COLON () LAC: initial context established for JTYPE_COLON LAC: checking lookahead JTYPE_COLON: Err Constructing syntax error message LAC: checking lookahead end of file: Err LAC: checking lookahead JTYPE_OPEN_BRACE: Err LAC: checking lookahead JTYPE_CLOSE_BRACE: S14 LAC: checking lookahead JTYPE_OPEN_BRACKET: Err LAC: checking lookahead JTYPE_CLOSE_BRACKET: Err LAC: checking lookahead JTYPE_COMMA: Err LAC: checking lookahead JTYPE_COLON: Err LAC: checking lookahead JTYPE_NULL: Err LAC: checking lookahead JTYPE_STRING: S15 LAC: checking lookahead JTYPE_UINTMAX: Err LAC: checking lookahead JTYPE_INTMAX: Err LAC: checking lookahead JTYPE_LONG_DOUBLE: Err LAC: checking lookahead JTYPE_BOOLEAN: Err JSON parser error (num errors: 1) on line 4: syntax error, unexpected JTYPE_COLON, expecting JTYPE_CLOSE_BRACE or JTYPE_STRING Error: popping token JTYPE_OPEN_BRACE () Stack now 0 Cleanup: discarding lookahead token JTYPE_COLON () Stack now 0 debug[0]: *** END PARSE
So it doesn't see the string. EDIT: Strings should be prefixed with
string:
btw.Why not use the simpler:
JTYPE_STRING \"[^"\n]*\"
and let the
malloc_json_decode()
function (viacalloc_json_conv_string()
andcalloc_json_conv_string_str()
) do the work?if the parser observes that
converted
isfalse
, then it can raise an error.
That's a great idea. I guess I forgot about that functionality. Thanks!
EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.
It is fine that Unicode is not OK. The UTF-8 works well.
By this we mean that using UTF-8 will work for the IOCCC. This presumes, however, that bison
and flex
will be OK with UTF-8, and if that this true, then all is well on this issue.
So, OK to remove strict
mode and use of -S
, @xexyl?
EDIT: A quick search suggests UTF-8 might be okay but Unicode not. That was for flex; didn't check bison.
It is fine that Unicode is not OK. The UTF-8 works well.
By this we mean that using UTF-8 will work for the IOCCC. This presumes, however, that
bison
andflex
will be OK with UTF-8, and if that this true, then all is well on this issue.
Of course. I guess we should verify for sure that both flex
and bison
work with UTF-8.
So, OK to remove
strict
mode and use of-S
, @xexyl?
I think so but perhaps the strict parsing could be commented out instead of removed outright for documentation purposes as well as if it's ever needed? Not sure: I only bring it up as an idea that might or might not have value. Either way I think removing it is good too.
For now, the fact that malloc_json_encode()
encodes non-ASCII bytes as \uXXXX could be considered lazy coding. While it would produce encoded strings that might be longer than needed, it does not violate JSON.
The malloc_json_decode()
in non-strict mode does not object to non-ASCII bytes, so that is UTF-8 OK as well.
The strict
mode was mainly to enforce the suggestions by this blog post, which are not needed.
I was about to commit this but something occurred to me.
Why not use the simpler:
JTYPE_STRING \"[^"\n]*\"
and let the
malloc_json_decode()
function (viacalloc_json_conv_string()
andcalloc_json_conv_string_str()
) do the work? if the parser observes thatconverted
isfalse
, then it can raise an error.That's a great idea. I guess I forgot about that functionality. Thanks!
Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:
JTYPE_STRING \"[^\n]*\"
I.e. removed the exclusion of "
. But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.
For now, the fact that
malloc_json_encode()
encodes non-ASCII bytes as \uXXXX could be considered lazy coding. While it would produce encoded strings that might be longer than needed, it does not violate JSON.The
malloc_json_decode()
in non-strict mode does not object to non-ASCII bytes, so that is UTF-8 OK as well.The
strict
mode was mainly to enforce the suggestions by this blog post, which are not needed.
Sounds good.
For now, the fact that
malloc_json_encode()
encodes non-ASCII bytes as \uXXXX could be considered lazy coding. While it would produce encoded strings that might be longer than needed, it does not violate JSON. Themalloc_json_decode()
in non-strict mode does not object to non-ASCII bytes, so that is UTF-8 OK as well. Thestrict
mode was mainly to enforce the suggestions by this blog post, which are not needed.Sounds good.
I'll remove the -S
option from the jinfochk
and jauthchk
tools later on. That very possibly will be tomorrow as it shouldn't take much effort.
With a bit of luck I will be more clear headed and alert and then I can also work out the string regex too by testing different inputs.
EDIT: The man pages will have to be updated as well: just remembered this.
I was about to commit this but something occurred to me.
Why not use the simpler:
JTYPE_STRING \"[^"\n]*\"
and let the
malloc_json_decode()
function (viacalloc_json_conv_string()
andcalloc_json_conv_string_str()
) do the work? if the parser observes thatconverted
isfalse
, then it can raise an error.That's a great idea. I guess I forgot about that functionality. Thanks!
Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:
JTYPE_STRING \"[^\n]*\"
I.e. removed the exclusion of
"
. But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.
The issue is how to get the parser to process this as a single JSON string:
"this is \"OK\" in JSON"
Humm ....
I was about to commit this but something occurred to me.
Why not use the simpler:
JTYPE_STRING \"[^"\n]*\"
and let the
malloc_json_decode()
function (viacalloc_json_conv_string()
andcalloc_json_conv_string_str()
) do the work? if the parser observes thatconverted
isfalse
, then it can raise an error.That's a great idea. I guess I forgot about that functionality. Thanks!
Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:
JTYPE_STRING \"[^\n]*\"
I.e. removed the exclusion of
"
. But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.The issue is how to get the parser to process this as a single JSON string:
"this is \"OK\" in JSON"
Humm ....
That's actually one of the things I had thought of earlier on. I'm not sure how to address it. At least not yet.
The flex
man page reads:
-8, --8bit
generate 8-bit scanner
Perhaps -8
is need to be given via the Makefile to flex
?
The
flex
man page reads:-8, --8bit generate 8-bit scanner
Perhaps
-8
is need to be given via the Makefile toflex
?
Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?
As I said in the other thread I am also just typing on my phone and I am about to go get some sleep but I can answer your questions tomorrow @lcn2.
As I also said I will be gone most of Saturday and probably will take a few days to recover.
Finally feel free to edit the title and this message or else let me know what you think is a good start.
We can then discuss what I have so far and how we should proceed with the parser.
I will say quickly before I go that I have kind of held back a bit to see where the structs (or the functions acting on them) go as I think it will be helpful to have something more to work on. Once the structs are sorted it should be easier to actually write the rules and actions in the lexer and parser.
I am not sure but there very possibly is missing grammar too.
Hope this is a good start for the issue. Have a good rest of your day and look forward seeing more here my friend!
Have a safe trip home when you go back.
TODO: