Enhancement: finish the C-based general JSON parser

xexyl commented 2 years ago

As I said in the other thread I am also just typing on my phone and I am about to go get some sleep but I can answer your questions tomorrow @lcn2.

As I also said I will be gone most of Saturday and probably will take a few days to recover.

Finally feel free to edit the title and this message or else let me know what you think is a good start.

We can then discuss what I have so far and how we should proceed with the parser.

I will say quickly before I go that I have kind of held back a bit to see where the structs (or the functions acting on them) go as I think it will be helpful to have something more to work on. Once the structs are sorted it should be easier to actually write the rules and actions in the lexer and parser.

I am not sure but there very possibly is missing grammar too.

Hope this is a good start for the issue. Have a good rest of your day and look forward seeing more here my friend!

Have a safe trip home when you go back.

TODO:

[x] Make better error messages for invalid input (and all errors in general).

lcn2 commented 2 years ago

I was about to commit this but something occurred to me.
Why not use the simpler:
JTYPE_STRING    \"[^"\n]*\"
and let the malloc_json_decode() function (via calloc_json_conv_string() and calloc_json_conv_string_str()) do the work? if the parser observes that converted is false, then it can raise an error.
That's a great idea. I guess I forgot about that functionality. Thanks!
Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:
JTYPE_STRING            \"[^\n]*\"
I.e. removed the exclusion of ". But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.
The issue is how to get the parser to process this as a single JSON string:
"this is \"OK\" in JSON"
Humm ....
That's actually one of the things I had thought of earlier on. I'm not sure how to address it. At least not yet.

Does this flex regular expression cheatsheet help?

lcn2 commented 2 years ago

The flex man page reads:
       -8, --8bit
              generate 8-bit scanner
Perhaps -8 is need to be given via the Makefile to flex?
Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?

We think you should try for getting flex --8bit to work.

xexyl commented 2 years ago

I was about to commit this but something occurred to me.
Why not use the simpler:
JTYPE_STRING    \"[^"\n]*\"
and let the malloc_json_decode() function (via calloc_json_conv_string() and calloc_json_conv_string_str()) do the work? if the parser observes that converted is false, then it can raise an error.
That's a great idea. I guess I forgot about that functionality. Thanks!
Does the fact there can be empty strings necessitate changing this regex in any way you can think of? I'm not sure right now. I was thinking possibly:
JTYPE_STRING            \"[^\n]*\"
I.e. removed the exclusion of ". But then I can imagine how this might also cause a different problem. I'm not sure and that is why I am done for the day with this. I can reply to other comments but I won't try working on the parser/scanner today.
The issue is how to get the parser to process this as a single JSON string:
"this is \"OK\" in JSON"
Humm ....
That's actually one of the things I had thought of earlier on. I'm not sure how to address it. At least not yet.
Does this flex regular expression cheatsheet help?

Not sure as it links instead to: https://amazingflex.wordpress.com/2011/08/30/latest-hot-news/. Maybe I can find it on the Internet Wayback Machine though.

xexyl commented 2 years ago

The flex man page reads:
       -8, --8bit
              generate 8-bit scanner
Perhaps -8 is need to be given via the Makefile to flex?
Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?
We think you should try for getting flex --8bit to work.

We can just add -8 to the Makefile I guess? Or were you thinking of something else?

lcn2 commented 2 years ago

Well that link might have been a dead end as it may have been someone called flex talking about regular expressions.

A more definitive link may be the section 6 of the flex manual.

xexyl commented 2 years ago

Well that link might have been a dead end as it may have been someone called flex talking about regular expressions.

A more definitive link may be the section 6 of the flex manual.

It's also in info flex I see. Maybe it will be of use. We shall see.

EDIT: Rereading it that's what you were referring to I guess.

xexyl commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

lcn2 commented 2 years ago

The flex man page reads:
       -8, --8bit
              generate 8-bit scanner
Perhaps -8 is need to be given via the Makefile to flex?
Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?
We think you should try for getting flex --8bit to work.
We can just add -8 to the Makefile I guess? Or were you thinking of something else?

Yes, as in:

                echo "$$FLEX_PATH -8 -o jparse.c jparse.l"; \
                "$$FLEX_PATH" -8 -o jparse.c jparse.l; \

xexyl commented 2 years ago

The flex man page reads:
       -8, --8bit
              generate 8-bit scanner
Perhaps -8 is need to be given via the Makefile to flex?
Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?
We think you should try for getting flex --8bit to work.
We can just add -8 to the Makefile I guess? Or were you thinking of something else?
Yes, as in:
                echo "$$FLEX_PATH -8 -o jparse.c jparse.l"; \
                "$$FLEX_PATH" -8 -o jparse.c jparse.l; \

Do you want to do that or should I try and remember to do it next time I work on the repo?

lcn2 commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

xexyl commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

lcn2 commented 2 years ago

The flex man page reads:
       -8, --8bit
              generate 8-bit scanner
Perhaps -8 is need to be given via the Makefile to flex?
Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?
We think you should try for getting flex --8bit to work.
We can just add -8 to the Makefile I guess? Or were you thinking of something else?
Yes, as in:
                echo "$$FLEX_PATH -8 -o jparse.c jparse.l"; \
                "$$FLEX_PATH" -8 -o jparse.c jparse.l; \
Do you want to do that or should I try and remember to do it next time I work on the repo?

See commit dfcce01a9d538caeafc965673326a42bf8167d88 for the change.

xexyl commented 2 years ago

The flex man page reads:
       -8, --8bit
              generate 8-bit scanner
Perhaps -8 is need to be given via the Makefile to flex?
Might be an idea. The question is does bison also have something like this and if so is it needed? Another question is: will using the option cause a problem? Maybe testing both modes will be necessary?
We think you should try for getting flex --8bit to work.
We can just add -8 to the Makefile I guess? Or were you thinking of something else?
Yes, as in:
                echo "$$FLEX_PATH -8 -o jparse.c jparse.l"; \
                "$$FLEX_PATH" -8 -o jparse.c jparse.l; \
Do you want to do that or should I try and remember to do it next time I work on the repo?
See commit dfcce01 for the change.

Thanks.

xexyl commented 2 years ago

I have a minor pull request btw if you didn't notice it.

lcn2 commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

For grep.pl, this seems to work well:

$ grep.pl '^"([^"]|\\")*"$'

As these strings pass:

"abc"
"ab\"def"
""
"\""

but these do not pass:

abc
"
"""
"ab"def"

xexyl commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

For grep.pl, this seems to work well:
$ grep.pl '^"([^"]|\\")*"$'
As these strings pass:
"abc"
"ab\"def"
""
"\""
but these do not pass:
abc
"
"""
"ab"def"

However the line boundaries should be removed so I should remove the ^ and $, right? Or am I missing some important point?

xexyl commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

For grep.pl, this seems to work well:
$ grep.pl '^"([^"]|\\")*"$'
As these strings pass:
"abc"
"ab\"def"
""
"\""
but these do not pass:
abc
"
"""
"ab"def"
However the line boundaries should be removed so I should remove the ^ and $, right? Or am I missing some important point?

And for some reason it's still not working. I'm wondering if it's to do with having to escape the "s.

I'll have to play with this one tomorrow and let you know what I come up with.

lcn2 commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

For grep.pl, this seems to work well:
$ grep.pl '^"([^"]|\\")*"$'
As these strings pass:
"abc"
"ab\"def"
""
"\""
but these do not pass:
abc
"
"""
"ab"def"
However the line boundaries should be removed so I should remove the ^ and $, right? Or am I missing some important point?

Yes.

xexyl commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

For grep.pl, this seems to work well:
$ grep.pl '^"([^"]|\\")*"$'
As these strings pass:
"abc"
"ab\"def"
""
"\""
but these do not pass:
abc
"
"""
"ab"def"
However the line boundaries should be removed so I should remove the ^ and $, right? Or am I missing some important point?
Yes.

Right. Tomorrow I'll see if I can figure out why it's not working in flex.

lcn2 commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

For grep.pl, this seems to work well:
$ grep.pl '^"([^"]|\\")*"$'
As these strings pass:
"abc"
"ab\"def"
""
"\""
but these do not pass:
abc
"
"""
"ab"def"
However the line boundaries should be removed so I should remove the ^ and $, right? Or am I missing some important point?
And for some reason it's still not working. I'm wondering if it's to do with having to escape the "s.

I'll have to play with this one tomorrow and let you know what I come up with.

Given that flex wants to put "'s around the string, we guess you will have to escape the "\s.

lcn2 commented 2 years ago

There is a strict mode that does not apply to JSON encoding.

Both jint and jfloat use a different -S for struct numeric testing. Those can stay, we guess.

lcn2 commented 2 years ago

Because the JSON spec does allow for whitespace before and after the JSON stuff as in:


    { "foo" : "bar" }

The strict process done by check_first_json_char() and check_last_json_char() is no longer needed.

We will modify these functions to always skip the whitespace.

xexyl commented 2 years ago

Because the JSON spec does allow for whitespace before and after the JSON stuff as in:
    { "foo" : "bar" }    
The strict process done by check_first_json_char() and check_last_json_char() is no longer needed.

We will modify these functions to always skip the whitespace.

Yes those are the functions I was referring to. I'm not sure they'll be needed once the json parser is done though?

lcn2 commented 2 years ago

Because the JSON spec does allow for whitespace before and after the JSON stuff as in:
    { "foo" : "bar" }    
The strict process done by check_first_json_char() and check_last_json_char() is no longer needed. We will modify these functions to always skip the whitespace.
Yes those are the functions I was referring to. I'm not sure they'll be needed once the json parser is done though?

Correct.

xexyl commented 2 years ago

Well tomorrow I'll see if I can look at the issue again. If you have any other thoughts I'm open to them and I'm happy to reply to other things here as well.

We will see if we can find a quick solution that matches the flex manual in the next few minutes, and post it if we think it might work.

Thank you! Hopefully you can figure out a good way to go about it.

For grep.pl, this seems to work well:
$ grep.pl '^"([^"]|\\")*"$'
As these strings pass:
"abc"
"ab\"def"
""
"\""
but these do not pass:
abc
"
"""
"ab"def"
However the line boundaries should be removed so I should remove the ^ and $, right? Or am I missing some important point?
And for some reason it's still not working. I'm wondering if it's to do with having to escape the "s. I'll have to play with this one tomorrow and let you know what I come up with.
Given that flex wants to put "'s around the string, we guess you will have to escape the "\s.

Right. And unfortunately it doesn't seem to work then. I'll let you know what I come up with. I'll be off for the day soon.

xexyl commented 2 years ago

Because the JSON spec does allow for whitespace before and after the JSON stuff as in:
    { "foo" : "bar" }    
The strict process done by check_first_json_char() and check_last_json_char() is no longer needed. We will modify these functions to always skip the whitespace.
Yes those are the functions I was referring to. I'm not sure they'll be needed once the json parser is done though?
Correct.

In that case maybe I should also not even call them in the tools. That way the functions can be removed entirely.

xexyl commented 2 years ago

There is a strict mode that does not apply to JSON encoding.

Both jint and jfloat use a different -S for struct numeric testing. Those can stay, we guess.

That makes sense to me.

lcn2 commented 2 years ago

See commit 203546c692e2c798c9f56a62fd18013cf505ed8e

Removed strict mode in JSON processing

The strict flags and -S argument use has been removed.

JSON documents may start and end with any amount of whitespace.

Use of non-standard \-escape is no longer considered for JSON strings.

Man pages adjusted to remove use of -S where no longer needed.

JSON string test function no longer tests a -S mode.

Hopefully JSON processing default both the correct method to use and will make the code simpler as well.

xexyl commented 2 years ago

See commit 203546c

Removed strict mode in JSON processing

The strict flags and -S argument use has been removed.

JSON documents may start and end with any amount of whitespace.

Use of non-standard \-escape is no longer considered for JSON strings.

Man pages adjusted to remove use of -S where no longer needed.

JSON string test function no longer tests a -S mode.

Seems good to me.

Hopefully JSON processing default both the correct method to use and will make the code simpler as well.

I'd think that as well.

lcn2 commented 2 years ago

Because the JSON spec does allow for whitespace before and after the JSON stuff as in:
    { "foo" : "bar" }    
The strict process done by check_first_json_char() and check_last_json_char() is no longer needed. We will modify these functions to always skip the whitespace.
Yes those are the functions I was referring to. I'm not sure they'll be needed once the json parser is done though?
Correct.
In that case maybe I should also not even call them in the tools. That way the functions can be removed entirely.

That seems reasonable. Do you want to do that?

xexyl commented 2 years ago

Just a thought though: maybe add a comment to the tools that in fact do require strict mode so nobody thinks to remove it without much thought?

xexyl commented 2 years ago

Because the JSON spec does allow for whitespace before and after the JSON stuff as in:
    { "foo" : "bar" }    
The strict process done by check_first_json_char() and check_last_json_char() is no longer needed. We will modify these functions to always skip the whitespace.
Yes those are the functions I was referring to. I'm not sure they'll be needed once the json parser is done though?
Correct.
In that case maybe I should also not even call them in the tools. That way the functions can be removed entirely.
That seems reasonable. Do you want to do that?

I can do it tomorrow if that helps. Alternatively it can just be done in time and not worry about it right now: eventually it will be removed.

lcn2 commented 2 years ago

Finally for today, based in part on your suggestion, we removed leading malloc and calloc in front of JSON function names in commit 46d46659f8c648ffc69e9048581bededb9a99ac5:

Renamed JSON encode and decode functions

These functions:

    extern char *malloc_json_encode(char const *ptr, size_t len, size_t *retlen);
    extern char *malloc_json_encode_str(char const *str, size_t *retlen);
    extern char *malloc_json_decode(char const *ptr, size_t len, size_t *retlen);
    extern char *malloc_json_decode_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_int(char const *str, size_t len);
    extern struct json *calloc_json_conv_int_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_float(char const *str, size_t len);
    extern struct json *calloc_json_conv_float_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_string(char const *str, size_t len, bool quote);
    extern struct json *calloc_json_conv_string_str(char const *str, size_t *retlen, bool quote);
    extern struct json *calloc_json_conv_bool(char const *str, size_t len);
    extern struct json *calloc_json_conv_bool_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_null(char const *str, size_t len);
    extern struct json *calloc_json_conv_null_str(char const *str, size_t *retlen);

are now just:

    extern char *json_encode(char const *ptr, size_t len, size_t *retlen);
    extern char *json_encode_str(char const *str, size_t *retlen);
    extern char *json_decode(char const *ptr, size_t len, size_t *retlen);
    extern char *json_decode_str(char const *str, size_t *retlen);
    extern struct json *json_conv_int(char const *str, size_t len);
    extern struct json *json_conv_int_str(char const *str, size_t *retlen);
    extern struct json *json_conv_float(char const *str, size_t len);
    extern struct json *json_conv_float_str(char const *str, size_t *retlen);
    extern struct json *json_conv_string(char const *str, size_t len, bool quote);
    extern struct json *json_conv_string_str(char const *str, size_t *retlen, bool quote);
    extern struct json *json_conv_bool(char const *str, size_t len);
    extern struct json *json_conv_bool_str(char const *str, size_t *retlen);
    extern struct json *json_conv_null(char const *str, size_t len);
    extern struct json *json_conv_null_str(char const *str, size_t *retlen);

xexyl commented 2 years ago

Finally for today, based in part on your suggestion, we removed leading malloc and calloc in front of JSON function names in commit 46d4665:

Enjoy the rest of your day!

Renamed JSON encode and decode functions

These functions:

    extern char *malloc_json_encode(char const *ptr, size_t len, size_t *retlen);
    extern char *malloc_json_encode_str(char const *str, size_t *retlen);
    extern char *malloc_json_decode(char const *ptr, size_t len, size_t *retlen);
    extern char *malloc_json_decode_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_int(char const *str, size_t len);
    extern struct json *calloc_json_conv_int_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_float(char const *str, size_t len);
    extern struct json *calloc_json_conv_float_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_string(char const *str, size_t len, bool quote);
    extern struct json *calloc_json_conv_string_str(char const *str, size_t *retlen, bool quote);
    extern struct json *calloc_json_conv_bool(char const *str, size_t len);
    extern struct json *calloc_json_conv_bool_str(char const *str, size_t *retlen);
    extern struct json *calloc_json_conv_null(char const *str, size_t len);
    extern struct json *calloc_json_conv_null_str(char const *str, size_t *retlen);

are now just:

    extern char *json_encode(char const *ptr, size_t len, size_t *retlen);
    extern char *json_encode_str(char const *str, size_t *retlen);
    extern char *json_decode(char const *ptr, size_t len, size_t *retlen);
    extern char *json_decode_str(char const *str, size_t *retlen);
    extern struct json *json_conv_int(char const *str, size_t len);
    extern struct json *json_conv_int_str(char const *str, size_t *retlen);
    extern struct json *json_conv_float(char const *str, size_t len);
    extern struct json *json_conv_float_str(char const *str, size_t *retlen);
    extern struct json *json_conv_string(char const *str, size_t len, bool quote);
    extern struct json *json_conv_string_str(char const *str, size_t *retlen, bool quote);
    extern struct json *json_conv_bool(char const *str, size_t len);
    extern struct json *json_conv_bool_str(char const *str, size_t *retlen);
    extern struct json *json_conv_null(char const *str, size_t len);
    extern struct json *json_conv_null_str(char const *str, size_t *retlen);

That looks good! Clearer and simpler names. The fact it returns a struct json * should indicate it uses calloc() anyway. Thanks.

Should this be done for other functions though?

xexyl commented 2 years ago

Because the JSON spec does allow for whitespace before and after the JSON stuff as in:
    { "foo" : "bar" }    
The strict process done by check_first_json_char() and check_last_json_char() is no longer needed. We will modify these functions to always skip the whitespace.
Yes those are the functions I was referring to. I'm not sure they'll be needed once the json parser is done though?
Correct.
In that case maybe I should also not even call them in the tools. That way the functions can be removed entirely.
That seems reasonable. Do you want to do that?
I can do it tomorrow if that helps. Alternatively it can just be done in time and not worry about it right now: eventually it will be removed.

I went to do this and then it occurred to me this cannot yet be done because the function also locates the first { and then skips past that character and although the current tool is not complete it at least passes for the default make test. Thus I won't remove these functions yet - not until they're no longer needed.

I'm leaving for the day though I might be able to reply to any comments a bit later. Have a good rest of your day my friend! More from me tomorrow.

lcn2 commented 2 years ago

Should this be done for other functions though?

Which functions are you referring to?

xexyl commented 2 years ago

Should this be done for other functions though?

Which functions are you referring to?

I believe that there are other functions with that terminology. But maybe it’s only in the comments? Perhaps those like readline? Not sure: would have to be at the computer.

lcn2 commented 2 years ago

The notion that JSON must begin with { and end with } is incorrect.

According to the JSON spec, JSON is an element.

An element can be an object such as:

{ "foo" : 1, "bar", 2 }

In this case JSON does begin with { and ends with }.

However an element can be an array such as:

[ "curds" ]

In this case JSON begins with [ and ends with ].

An element can be a string such as:

"This is valid JSON"

In this case JSON begins with " and ends with ".

An element can be a number such as:

In this case JSON begins a digit and ends with a digit.

An element can be a true such as:

true

In this case JSON begins a t and ends with a e.

An element can be a false such as:

false

In this case JSON begins a f and ends with a e.

An element can be a null such as:

null

In this case JSON begins a n and ends with a l.

Moreover, each of those above examples can start with end ends with whitespace. I.e., the JSON element can be surrounded with whitespace.

So a JSON can begin with:

whitespace
{
[
"
digit
t
f
n

A JSON can end with:

whitespace
}
]
"
digit
e
l

Important NOTE: JSON cannot be only whitespace, nor can it be an empty file. This is because while an element can have leading and trailing whitespace, the element has to have a something in between.

There is this handy duckduckgo JSON validator as well.

There also this JSONLint as well.

You may wish to use those tools to check for valid JSON.

lcn2 commented 2 years ago

According to the JSON spec, an array can have zero or more elements.

For example this is a valid JSON array:

[ 23209 ]

This is a valid JSON array:

[ true ]

This is a valid JSON array:

[ "hi" ]

etc.

A JSON array can be empty such as:

[]

And of course, a JSON array and have multiple members separated by a ,.

On the other hand a JSON object can have zero or more members (not elements).

This is a valid JSON object:

{ "foo" : "bar" }

A JSON object can be empty, such as:

{}

So JSON arrays can consist of zero or more elements, whereas JSON objects can consist of zero of more members.

The key difference is that elements are things such as:

objects such as { ... }
arrays such as [ ... ]
strings such as "..."
numbers such as 21071
true
false
null

whereas members are always:

"string" : element

So a member will be a "string" followed by a : followed an element as shown above.

Again, there is this handy duckduckgo JSON validator as well.

There also this JSONLint as well.

You may wish to use those tools to check for valid JSON.

lcn2 commented 2 years ago

In JSON, a number cannot end in a period.

For example a JSON number CANNOT be 23209. alone, it must be 23209.0 ending in a digit.

If a JSON number begins with 0, then it CANNOT have any other digits that follow.

For example a JSON number CANNOT be 012 as a leading 0 CANNOT be followed by more digits.

Also for example a JSON number CANNOT be .123, it must be 0.123 with a leading 0.

And of course a JSON number CANNOT be just . all by itself. A . must be both preceded by and followed by one or more digits.

A JSON number can begin with a -, and so -0 is a valid JSON number.

If a JSON number begins with -0, then it cannot be followed by more digits.

For example a JSON number CANNOT be -012 all by itself.

If a JSON NUMBER starts with -0 it can ONLY be -0.digits or -0 all by itself.

lcn2 commented 2 years ago

In regards to the previous comments about JSON numbers, we plan to fix the functions that convert to struct json_integer and to struct json_floating so that those functions will not convert the previously mentioned invalid JSON numbers.

So, for example, these INVALID JSON numbers:

012
-012
12.
-12.
.12
-.12
0.
-0.
.0
-.0
.
-.

will result in a converted being set to false.

NOTE: You don't need to overly complete the JSON parser regular expression for numbers if your don't wish to do so. You can pass a more general set of numerical characters to the conversion functions and then check the converted boolean value.

lcn2 commented 2 years ago

In the process of working on the JSON compound item conversion functions (element, object, array) we needed to dive into some of the fine details of the JSON spec.

This resulted our writing the above set of comments on JSON.

We hope you find these comments useful.

xexyl commented 2 years ago

In the process of working on the JSON compound item conversion functions (element, object, array) we needed to dive into some of the fine details of the JSON spec.

This resulted our writing the above set of comments on JSON.

We hope you find these comments useful.

Thanks. I'm sure they will be useful! I hope you're getting some sleep by now.

xexyl commented 2 years ago

Should this be done for other functions though?

Which functions are you referring to?

I believe that there are other functions with that terminology. But maybe it’s only in the comments? Perhaps those like readline? Not sure: would have to be at the computer.

I was referring to the comments like:

/* malloced JSON floating point string, whitespace trimmed if needed */

Since some of them use calloc() and not malloc() the wording could be changed to allocated? Just a thought. A simple sed invocation would fix this. Want me to do that?

lcn2 commented 2 years ago

Should this be done for other functions though?

Which functions are you referring to?

I believe that there are other functions with that terminology. But maybe it’s only in the comments? Perhaps those like readline? Not sure: would have to be at the computer.

I was referring to the comments like:
/* malloced JSON floating point string, whitespace trimmed if needed */
Since some of them use calloc() and not malloc() the wording could be changed to allocated? Just a thought. A simple sed invocation would fix this. Want me to do that?

Yes please.

xexyl commented 2 years ago

Should this be done for other functions though?

Which functions are you referring to?

I believe that there are other functions with that terminology. But maybe it’s only in the comments? Perhaps those like readline? Not sure: would have to be at the computer.

I was referring to the comments like:
/* malloced JSON floating point string, whitespace trimmed if needed */
Since some of them use calloc() and not malloc() the wording could be changed to allocated? Just a thought. A simple sed invocation would fix this. Want me to do that?
Yes please.

Done. I was expecting you to be asleep so perhaps the other comment in that commit about doing a pull request prior to you seeing it might not be true. Anyway I have several commits but not doing a pull request yet.

lcn2 commented 2 years ago

Going to sleep 🛌 again.

xexyl commented 2 years ago

Going to sleep 🛌 again.

Sleep well my friend!

xexyl commented 2 years ago

The notion that JSON must begin with { and end with } is incorrect.

According to the JSON spec, JSON is an element.

An element can be an object such as:
{ "foo" : 1, "bar", 2 }
In this case JSON does begin with { and ends with }.

However an element can be an array such as:
[ "curds" ]
In this case JSON begins with [ and ends with ].

An element can be a string such as:
"This is valid JSON"
In this case JSON begins with " and ends with ".

An element can be a number such as:
23209
In this case JSON begins a digit and ends with a digit.

An element can be a true such as:
true
In this case JSON begins a t and ends with a e.

An element can be a false such as:
false
In this case JSON begins a f and ends with a e.

An element can be a null such as:
null
In this case JSON begins a n and ends with a l.

These are interesting. It'll require several updates to the bison rules but that's okay. These will also have to do something with a parse tree of course. The question is how to go about a json document/string that is only a number: is a parse tree even needed? Of course we can't know ahead of time so it might be that we have to go about building one anyway.

But what do we do about names when there are none?

An obvious question also is what to do about content like:

222,
222

which the validators you included below suggest is incorrect (though I thought of it first and then decided to test it - at least with one).

Moreover, each of those above examples can start with end ends with whitespace. I.e., the JSON element can be surrounded with whitespace.

So a JSON can begin with:

whitespace

{

[

"

digit

t

f

n

A JSON can end with:

whitespace

}

]

"

digit

e

l

What are the final two lines though? JSON can end in the letter e and also l? I must be misunderstanding this.

As for whitespace it's not a problem though: I actually ignore whitespace because it would greatly (dramatically even) complicate the parser rules. Now at least I know it's not a problem to skip so I've removed the comments about that (though I still include the whitespace and print that whitespace is found for debugging purposes: once the parser is complete I can remove it entirely).

Important NOTE: JSON cannot be only whitespace, nor can it be an empty file. This is because while an element can have leading and trailing whitespace, the element has to have a something in between.

This is not a problem either. Right now it prints:

$ ./jparse -s '' blank 
debug[0]: Calling parse_json_string(""):
Warning: parse_json_string: passed empty string
debug[0]: Calling parse_json_file("blank"):
Warning: parse_json_file: blank is empty

but the bison rules would cause a problem too if these checks were removed.

There is this handy duckduckgo JSON validator as well.

There also this JSONLint as well.

You may wish to use those tools to check for valid JSON.

Thank you. Using a lint is a great idea!

xexyl commented 2 years ago

In JSON, a number cannot end in a period.

For example a JSON number CANNOT be 23209. alone, it must be 23209.0 ending in a digit.

If a JSON number begins with 0, then it CANNOT have any other digits that follow.

For example a JSON number CANNOT be 012 as a leading 0 CANNOT be followed by more digits.

Damn. This will mean there will have to be some changes in the lexer grammar. Good to know though.

Also for example a JSON number CANNOT be .123, it must be 0.123 with a leading 0.

And of course a JSON number CANNOT be just . all by itself.

Right.

A . must be both preceded by and followed by one or more digits.

Unless it's in a string of course :)

I thought this would be already taken care of implicitly but it appears not to be:

$ ./jparse -s '.'
debug[0]: Calling parse_json_string("."):
debug[0]: *** BEGIN PARSE:
'
.
'
Starting parse
Entering state 0
Stack now 0
Reading a token
Now at end of input.
LAC: initial context established for end of file
LAC: checking lookahead end of file: R1 G9 S21
Reducing stack by rule 1 (line 102):
-> $$ = nterm json ()
Entering state 9
Stack now 0 9
Now at end of input.
Shifting token end of file ()
LAC: initial context discarded due to shift
Entering state 21
Stack now 0 9 21
Stack now 0 9 21
Cleanup: popping token end of file ()
Cleanup: popping nterm json ()
debug[0]: *** END PARSE
.

I'm not sure why this is though unless it's the built in default rule? But adding the option nodefault introduces some other issues. Will have to work this one out.

A JSON number can begin with a -, and so -0 is a valid JSON number.

If a JSON number begins with -0, then it cannot be followed by more digits.

For example a JSON number CANNOT be -012 all by itself.

If a JSON NUMBER starts with -0 it can ONLY be -0.digits or -0 all by itself.

These complicate matters as well. I thought I had all the regexes for the numbers resolved but apparently not. I'll have to address these then. Thank you (even if it's not what I wanted to see).

ioccc-src / mkiocccentry

Enhancement: finish the C-based general JSON parser #156