ioccc-src / mkiocccentry

Form an IOCCC entry as a compressed tarball file
Other
28 stars 5 forks source link

Enhancement: finish the C-based general JSON parser #156

Closed xexyl closed 1 year ago

xexyl commented 2 years ago

As I said in the other thread I am also just typing on my phone and I am about to go get some sleep but I can answer your questions tomorrow @lcn2.

As I also said I will be gone most of Saturday and probably will take a few days to recover.

Finally feel free to edit the title and this message or else let me know what you think is a good start.

We can then discuss what I have so far and how we should proceed with the parser.

I will say quickly before I go that I have kind of held back a bit to see where the structs (or the functions acting on them) go as I think it will be helpful to have something more to work on. Once the structs are sorted it should be easier to actually write the rules and actions in the lexer and parser.

I am not sure but there very possibly is missing grammar too.

Hope this is a good start for the issue. Have a good rest of your day and look forward seeing more here my friend!

Have a safe trip home when you go back.

TODO:

lcn2 commented 2 years ago

In support of this issue, we plan to add a dynamic array interface. This will be needed to support building structures for JSON arrays, for example.

xexyl commented 2 years ago

In support of this issue, we plan to add a dynamic array interface. This will be needed to support building structures for JSON arrays, for example.

Sounds good. We can then discuss it more here.

Hope you're having a nice sleep my friend!

EDIT: I hope to in the coming days write some information on where I am with this as in what is done so far and what I'm thinking on etc. It might very well not be until the middle of next week but we'll see.

lcn2 commented 2 years ago

We will assume that compound JSON aspects (such as arrays, objects with members, member) we will presume that their sub-components will be parsed first before the parent object is parsed.

For example the JSON parser will need to parse the zero or more elements of an array before the parent JSON array is finally parsed. So when parsing:

[
     "foo" : "bar",
     "curds" : 123,
     "when" : { "fizz" : "bin" }
]

The array object will be completely parsed after the 3 elements are parsed, so the struct json_array will be finalized after the three struct json_member structures (the last one with a value that points to another struct json_member) are finalized.

CORRECTION: The cut and paste of info was botched, sorry (tm Canada :) ). Reentering this later with more info.

xexyl commented 2 years ago

We will assume that compound JSON aspects (such as arrays, objects with members, member) we will presume that their sub-components will be parsed first before the parent object is parsed.

Not sure if I understand this sentence particularly the first part. Mind clarifying?

For example the JSON parser will need to parse the zero or more elements of an array before the parent JSON array is finally parsed. So when parsing:

[
     "foo" : "bar",
     "curds" : 123,
     "when" : { "fizz" : "bin" }
]

The array object will be completely parsed after the 3 elements are parsed, so the struct json_array will be finalized after the three struct json_member structures (the last one with a value that points to another struct json_member) are finalized.

I'll wait and see what you have in mind with the array code as it'll allow me to better come up with any questions/thoughts (I think).

xexyl commented 2 years ago

Okay I'll be off for the day now. I might have a chance to reply to messages later but not sure of that. Tomorrow morning I will have a bit of time but later in the morning until the end of the day I won't be able to interact with you so I hope you have a great Saturday. I am sure I'll do some things or at least reply to some messages though.

Good day!

lcn2 commented 2 years ago

Not sure if I understand this sentence particularly the first part. Mind clarifying?

Cut and paste of text on a iPhone failed, sorry (tm Canada :) ).

When a compound JSON item is parsed, such as a JSON array, the sub-items of the compound JSON item need to be parsed before the parse of the compound JSON item is completed.

What is not clear, because it depends on how bison / flex generated C code works, is when things happen.

Take this compound JSON item:

"curds" : 123

Somewhere in the timeline, the following JSON:

"curds"

will be identified as a string and then sometime malloc_json_conv_string() will be called with the curds string.

The struct json_string pointer returned by malloc_json_conv_string() will be loaded into a struct json and the type will be set to JTYPE_STRING by a (as yet to be written) function.

And somewhere in the timeline, the following JSON:

123

will be identified as an integer and then then sometime malloc_json_conv_int() will be called with the 123 string.

The struct json_integer pointer returned by malloc_json_conv_int() will be loaded into a struct json and the type will be set to JTYPE_INT by a (as yet to be written) function.

So now you have two struct json structures, one for "curds" and one for 123. One is the name and one is the value for a struct json_member.

The JSON parser, have identified a member will call by a (as yet to be written) function that creates struct json_member and sets the name and value accordingly. This struct json_member pointer returned by a by a (as yet to be written) function will be loaded into a struct json and the type will be set to _JTYPEMEMBER.

lcn2 commented 2 years ago

From the above comment, functions such as the malloc_json_conv_int() function should return a struct json pointer (not a struct json_integer pointer).

Thestruct json should also be filled out in themalloc_json_conv_int() function should do so along the following concept:

...
struct json_integer *foo;
struct json *ret;
...
ret = calloc(1, sizeof(struct json));
if (ret == NULL) {
    ...
}
...
foo = calloc(1, sizeof(struct json_integer));
if (foo == NULL) {
    ...
}
...
ret->type = JTYPE_INT;
ret->element.integer = foo;
ret->parent = NULL;
ret->prev = NULL;
ret->next = NULL;
...
return ret;

OK, with better variable names and more comments 👍, but hopefully you get the concept.

We will start to make the needed changes.

xexyl commented 2 years ago

Good morning my friend! I hope you're having a nice sleep though given the time; I cannot sleep so I decided to sit up.

Not sure if I understand this sentence particularly the first part. Mind clarifying?

Cut and paste of text on a iPhone failed, sorry (tm Canada :) ).

I understand that all too well but I think we all do.

When a compound JSON item is parsed, such as a JSON array, the sub-items of the compound JSON item need to be parsed before the parse of the compound JSON item is completed.

What is not clear, because it depends on how bison / flex generated C code works, is when things happen.

That's what I was thinking with the sentence above this one (okay, the sentence above the sentence above this one! :) ): how could I force the order?

On the other hand though it makes sense now that I've read it after a long sleep (though I wish I was still asleep as I'll be awake many hours today though for good reasons).

Take this compound JSON item:

"curds" : 123

Somewhere in the timeline, the following JSON:

"curds"

will be identified as a string and then sometime malloc_json_conv_string() will be called with the curds string.

Right.

The struct json_string pointer returned by malloc_json_conv_string() will be loaded into a struct json and the type will be set to JTYPE_STRING by a (as yet to be written) function.

Okay. That makes sense. Actually though I'm still unsure how to use the:

%union json_type { 
struct json *json;
...
}

because in the lexer I'd have e.g.:

yylval->json

but this means it has to be allocated ... and then inside that there has to be the right allocations and assignments.

Thus I think I have to change it to:

%union json_type { 
struct json json;
...
}

and that would greatly simplify things. In fact I'll try doing that today (if I remember to do so after I write my replies to your comments).

And somewhere in the timeline, the following JSON:

123

will be identified as an integer and then then sometime malloc_json_conv_int() will be called with the 123 string.

The struct json_integer pointer returned by malloc_json_conv_int() will be loaded into a struct json and the type will be set to JTYPE_INT by a (as yet to be written) function.

So now you have two struct json structures, one for "curds" and one for 123. One is the name and one is the value for a struct json_member.

Right. I was thinking of that as I was reading the above as well and not sure how to address that.

The JSON parser, have identified a member will call by a (as yet to be written) function that creates struct json_member and sets the name and value accordingly. This struct json_member pointer returned by a by a (as yet to be written) function will be loaded into a struct json and the type will be set to _JTYPEMEMBER.

That sounds reasonable. If you're up to doing that that would be of help: I mean writing the function that sets the name and value in struct json_member *. Is that something you had in mind to do?

xexyl commented 2 years ago

From the above comment, functions such as the malloc_json_conv_int() function should return a struct json pointer (not a struct json_integer pointer).

That's an interesting point too because in the book they use a cast as well so that they have a single type (but cast when necessary). On the other hand since struct json will have the union and the type (for the union) it might not need a cast in that case. What will have to change, of course, is code that calls the function(s).

Thestruct json should also be filled out in themalloc_json_conv_int() function should do so along the following concept:

...
struct json_integer *foo;
struct json *ret;
...
ret = calloc(1, sizeof(struct json));
if (ret == NULL) {
    ...
}
...
foo = calloc(1, sizeof(struct json_integer));
if (foo == NULL) {
    ...
}
...
ret->type = JTYPE_INT;
ret->element.integer = foo;
ret->parent = NULL;
ret->prev = NULL;
ret->next = NULL;
...
return ret;

OK, with better variable names and more comments 👍, but hopefully you get the concept.

Of course.

We will start to make the needed changes.

Thanks! That will be of great help!

xexyl commented 2 years ago

Now as for the JTYPE_ versus JSON_ I'm still not sure what to do. I wish you or I had thought of opening a separate issue (this issue) earlier on because I had noted in the other thread an idea I had about this subject but finding it might be difficult: though perhaps in Mail.app I can do a search in that thread for JTYPE_.

The problem is of course that the parser needs token names as well and I figured the prefix JSON_ would be ideal. However I'm starting to wonder if for the parser that JTYPE_ would be better (esp since the code nobody will look at .. without nightmares at least :)) ) and the parse tree related struct/union should have JSON_ prefix. What do you think about that? Maybe I should do that. On the other hand maybe there's a better solution. Any thoughts on that?

UPDATE 0: commit cccbff7 takes care of the changing the struct json * to struct json as well as the above prefix changes. I trust that you'll appreciate and hopefully laugh at the commit log wrt the prefixes:

    The JTYPE_ enum in jparse.h have been renamed with the prefix JSON_ and
    the JSON_ in the lexer and parser have been prefixed instead with JTYPE_
    as nobody in their right mind will look at the parser or scanner code
    but people will look at the other code. Actually nobody in their left
    (wrong) mind would look at the generated code either because anyone who
    does will have horrible nightmares about spaghetti code with very long
    tentacles attached to the pasta that tries to grab them and pull them
    into its mouth before consuming them with insatiable craving for humans
    in its eyes. Okay you get the point: JTYPE_ is a better name for the
    ugly, offensive code that can't be maintained whereas JSON_ is better
    for code that can be maintained and is not ugly or offensive.

UPDATE 1: I don't plan on doing anything else but as it's still early it's possible I do. Tomorrow will be a slow day almost certainly though if nothing else we can discuss some things.

lcn2 commented 2 years ago

With commit 8f9e9f9621a1d4d2308776caa1609bd712e6f234 we have added code to generate JSON parse tree nodes from the fundament JSON types:

Each of the above interfaces are given a pointer to the beginning character of the JSON item and a length. This pointer/length interface allows one to point to text within a larger character block. The text does NOT need to be NUL but terminated.

Consider this JSON:

{ "foo" : 123 }

For the sake of this example, assume:

char *foo = "{ \"foo\" : 123 }";
/*                       ^-- bar == foo+10 */

char *bar = foo+10;   /* point at the '1' in the above JSON string */

One may call malloc_json_conv_int() as follows:

struct json *node = NULL;
...
node = malloc_json_conv_int(bar, 3);

These _malloc_json_convfoo() functions return a pointer to a struct json, a node in the JSON parse tree. The JSON parse tree node is NOT linked. I.e., the parent, prev, and next pointers are NULL. This is because it will the responsibility of functions caller to link in the JSON parse tree into the overall parse tree.

The _malloc_json_convfoo() functions will never return NULL. In the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

For each _malloc_json_convfoo() function, there is a _malloc_json_conv_foostr() function. The _malloc_json_conv_foostr() function interface is given a pointer to a NUL terminated string:

char *bar = "123";
struct json *node = NULL;
...
node = malloc_json_conv_int(bar, NULL);

Like _malloc_json_convfoo() functions , they will NOT return NULL. And as before, in the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

The _malloc_json_conv_foostr() function interface includes a pointer to a string size, instead of a string length. If that pointer is NULL (as was in the above example) no length is stored. However one can be given the length of the string as a side effect of the call as in:

char *bar = "123";
struct json *node = NULL;
size_t len;
...
node = malloc_json_conv_int(bar, &len);

One should use _malloc_json_convfoo() functions to point to a substring inside a larger JSON block, and use _malloc_json_conv_foostr() function interface to point to a C string that contains the JSON item in question.

Neither the _malloc_json_convfoo() functions nor the _malloc_json_conv_foostr() functions modify the text they are given. Moreover each one of them malloced a C string containing a copy of the JSON item. In each of the above examples, node->as_str points to a copy of the JSON item as a NUL terminated C string. Moreover, node->as_str_len contains the length of that NUL terminated C string.

Each JSON parse tree node as an important structure member: node->converted. This is a boolean that indicates if the conversion function was able to convert or not. If node->converted == false, then the conversion function was NOT possible.

When node->converted == false, only as_str and as_str_len contain valid values. All other aspects of the JSON element union are invalid. Thus before you use other aspects of the JSON element union, you MUST check the node->converted value first.

The malloc_json_conv_string() and malloc_json_conv_string_str() interfaces include an final boolean argument to indicate off the JSON string conversion should be done under "strict" rules (or NOT).

The JSON string conversion does NOT include the JSON double quotes. Thus:

struct json *node = NULL;
char *foo = "foo";
char *bar = "\"bar\"";
...
node = malloc_json_conv_string_str(foo, NULL, false);  /* correct */
...
node = malloc_json_conv_string_str(bar, NULL, false);  /* incorrect */

The JSON integer and floating point conversion routines will ignore leading and trailing whitespace. Normally in a JSON document, there should be NO such leading and trailing whitespace. However the C string the numerical value functions ignore leading and trailing whitespace. The node->as_str will contain a malloced copy of the string with any whitespace stripped off. The orig_len will give the original length (pre-stripping) of the string, whereas as_str_len contains the length of the duplicated string that might be stripped. Thus is node-> orig_len == node-> as_str_len then the original JSON string containing the numeric value did NOT contain any leading and trailing whitespace.

lcn2 commented 2 years ago

We plan to work on functions that form compound JSON nodes such as JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY next.

lcn2 commented 2 years ago

With commit 3179f21b1047602756df1532300caf4506392d47 the dynamic array facility as been added.

TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

xexyl commented 2 years ago

With commit 8f9e9f9 we have added code to generate JSON parse tree nodes from the fundament JSON types:

  • integer - malloc_json_conv_int()
  • floating point - malloc_json_conv_float
  • string - malloc_json_conv_string()
  • boolean - malloc_json_conv_bool()
  • null - malloc_json_conv_null()

Each of the above interfaces are given a pointer to the beginning character of the JSON item and a length. This pointer/length interface allows one to point to text within a larger character block. The text does NOT need to be NUL but terminated.

I guess the but was accidentally added; looking at the code I can only presume so since the length is passed in as a parameter and that's used in the call to malloc. A question for you on the subject of malloc though:

Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).

Consider this JSON:

{ "foo" : 123 }

For the sake of this example, assume:

char *foo = "{ \"foo\" : 123 }";
/*                       ^-- bar == foo+10 */

char *bar = foo+10;   /* point at the '1' in the above JSON string */

Right. This makes sense.

One may call malloc_json_conv_int() as follows:

struct json *node = NULL;
...
node = malloc_json_conv_int(bar, 3);

Of course we can't know the length of the number like that. However in the lexer we could assign a size_t the value of strlen(yytext) so that we do have the length. But perhaps you have built its here this is not actually necessary?

These _malloc_json_convfoo() functions return a pointer to a struct json, a node in the JSON parse tree. The JSON parse tree node is NOT linked. I.e., the parent, prev, and next pointers are NULL. This is because it will the responsibility of functions caller to link in the JSON parse tree into the overall parse tree.

Okay. I will have to think on how to do this. It might be obvious but not in my current state: I was awake quite late for me and awake at 0400 so I'm not awake enough. I probably won't fully process this until I start looking at the code and maybe the book. I actually do think chapter 3 will be of value: it does create a tree for the calculator mostly through bison. Of course other chapters will also be of value but I think chapter 3 will be quite valuable.

The _malloc_json_convfoo() functions will never return NULL. In the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

This makes sense.

For each _malloc_json_convfoo() function, there is a _malloc_json_conv_foostr() function. The _malloc_json_conv_foostr() function interface is given a pointer to a NUL terminated string:

Good to know this. I just looked at the string version and I see it's just a simplified form.

char *bar = "123";
struct json *node = NULL;
...
node = malloc_json_conv_int(bar, NULL);

Is there a mistake here? Because the function prototype looks like:

struct json *
malloc_json_conv_int(char const *str, size_t len)

Ah. Perhaps you made a typo for malloc_json_conv_int_str? That seems what it is (and below suggests it too) but want to be sure I am not missing something.

Like _malloc_json_convfoo() functions , they will NOT return NULL. And as before, in the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

The _malloc_json_conv_foostr() function interface includes a pointer to a string size, instead of a string length. If that pointer is NULL (as was in the above example) no length is stored. However one can be given the length of the string as a side effect of the call as in:

char *bar = "123";
struct json *node = NULL;
size_t len;
...
node = malloc_json_conv_int(bar, &len);

One should use _malloc_json_convfoo() functions to point to a substring inside a larger JSON block, and use _malloc_json_conv_foostr() function interface to point to a C string that contains the JSON item in question.

Would you please give an example of each if you have any in mind?

Neither the _malloc_json_convfoo() functions nor the _malloc_json_conv_foostr() functions modify the text they are given. Moreover each one of them malloced a C string containing a copy of the JSON item. In each of the above examples, node->as_str points to a copy of the JSON item as a NUL terminated C string. Moreover, node->as_str_len contains the length of that NUL terminated C string.

Makes sense.

Each JSON parse tree node as an important structure member: node->converted. This is a boolean that indicates if the conversion function was able to convert or not. If node->converted == false, then the conversion function was NOT possible.

In which case we cannot rely on the struct being valid.

When node->converted == false, only as_str and as_str_len contain valid values. All other aspects of the JSON element union are invalid. Thus before you use other aspects of the JSON element union, you MUST check the node->converted value first.

Right.

The malloc_json_conv_string() and malloc_json_conv_string_str() interfaces include an final boolean argument to indicate off the JSON string conversion should be done under "strict" rules (or NOT).

I noticed the boolean but I haven't really looked at the code to see what that means - yet.

The JSON string conversion does NOT include the JSON double quotes. Thus:

struct json *node = NULL;
char *foo = "foo";
char *bar = "\"bar\"";
...
node = malloc_json_conv_string_str(foo, NULL, false);  /* correct */
...
node = malloc_json_conv_string_str(bar, NULL, false);  /* incorrect */

Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Of course we don't want to include the "s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer "s? I guess that might be a good idea but only strip off the outermost ones.

The JSON integer and floating point conversion routines will ignore leading and trailing whitespace. Normally in a JSON document, there should be NO such leading and trailing whitespace. However the C string the numerical value functions ignore leading and trailing whitespace. The node->as_str will contain a malloced copy of the string with any whitespace stripped off. The orig_len will give the original length (pre-stripping) of the string, whereas as_str_len contains the length of the duplicated string that might be stripped. Thus is node-> orig_len == node-> as_str_len then the original JSON string containing the numeric value did NOT contain any leading and trailing whitespace.

No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense.

xexyl commented 2 years ago

With commit 3179f21 the dynamic array facility as been added.

I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.

TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

I'll wait on this to be done before I ask anything or make any comments.

What else has to be done btw?

lcn2 commented 2 years ago

With commit 3179f21 the dynamic array facility as been added.

I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.

TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

I'll wait on this to be done before I ask anything or make any comments.

What else has to be done btw?

We have more edits along the way in dyn_alloc.c and den_alloc.h to better support read_all(). We then will probably expand on dyn_alloc_trest to use read_all() to read into memory, the binary of itself.

Once that has been coded and tested, then we will be ready to code the JSON compound functions.

lcn2 commented 2 years ago

Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).

The duplicate the JSON integer string as done using malloc() and will be changed to use calloc().

Thanks @xexyl.

xexyl commented 2 years ago

Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).

The duplicate the JSON integer string as done using malloc() and will be changed to use calloc().

Thanks @xexyl.

Sure. There are other places that use malloc as well and maybe those should use calloc() also? In the code I've added (as I think I might have said) I use calloc() so those don't need to be converted.

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

xexyl commented 2 years ago

With commit 3179f21 the dynamic array facility as been added.

I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.

TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

I'll wait on this to be done before I ask anything or make any comments. What else has to be done btw?

We have more edits along the way in dyn_alloc.c and den_alloc.h to better support read_all(). We then will probably expand on dyn_alloc_test to use read_all() to read into memory, the binary of itself.

Once that has been coded and tested, then we will be ready to code the JSON compound functions.

Sounds good. I eagerly await to see what you come up with! I'll probably be a few days before I can do anything much but I'm sure each day I'll do a bit.

lcn2 commented 2 years ago

Would you please give an example of each if you have any in mind? ... Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Of course we don't want to include the "s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer "s? I guess that might be a good idea but only strip off the outermost ones.

The point of the example was to show that the JSON surrounding "'s should not be given to malloc_json_conv_string() nor to malloc_json_conv_string_str().

Assume this JSON document in memory and is pointed by ptr:

{ "foobar" : 12379 }

So you don't have to count, by hand, bytes in the above string, here are a few facts about that string:

In this case you would NOT want to use malloc_json_conv_string_str() because the string in memory is NOT NUL byte terminated at ptr+9.

The conversion call you would want to make (using those above hand counted byte addresses :) ) is:

node = malloc_json_conv_string(ptr+3, 6, strict);

NOTE: The boolean strict may be true or false depending on if strict JSON is in effect.

After the above call:

Had node->element.string.converted been false, then the ABOVE JSON parse tree structure fields would have been the ONLY JSON parse tree structure fields with valid values. However, because node->element.string.converted == true we know that the rest of the struct json_string are valid. In particular:

Accordingly you would NOT want to call malloc_json_conv_int_str(ptr+13, NULL) because this try to integer encode "`12379 }" which is not want you want. Instead you would call (using those above hand counted byte addresses :) ) is:

node = malloc_json_conv_int(ptr+13, 5);

After the above call:

Had node->element.string.converted been false, then the ABOVE JSON parse tree structure fields would have been the ONLY JSON parse tree structure fields with valid values. However, because node->element.integer.converted == true we know that the rest of the struct json_integer are valid. In particular:

Because node->element.integer.int8_sized == false, we know that node->element.integer.as_int8 does NOT contain a valid value and so we ignore it.

Continuing just for a few more of the struct json_integer fields:

Because node->element.integer.int16_sized == true, we know the node->element.integer.as_int16 has a valid value:

Containing for a bit (pun intended):

Because the above mentioned values are true, their corresponding values are valid:

We hope this helps, @xexyl.

UPDATE: The above example has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)

xexyl commented 2 years ago

Would you please give an example of each if you have any in mind? ... Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Of course we don't want to include the "s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer "s? I guess that might be a good idea but only strip off the outermost ones.

The point of the example was to show that the JSON surrounding "'s should not be given to malloc_json_conv_string() nor to malloc_json_conv_string_str().

Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?

We hope this helps, @xexyl.

I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.

lcn2 commented 2 years ago

Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?

Yes the JSON surrounding "'s are lexical.

JSON surrounding "'s are lexical just like {, }, :, [, ], , are lexical.

Of course, the tricky bit is that now every instance of those lexical characters are lexical. If they appear inside a JSON encoded string, the are text to be decoded:

{
"foo" : "{}[]:,"
}

UPDATE: While your JTYPE_STRING does need the surrounding "'s what you provide to, say, malloc_json_conv_string() will be one byte beyond the first surrounding " and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding ".

lcn2 commented 2 years ago

I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.

The above document has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)

You might want to refetch it @xexyl.

xexyl commented 2 years ago

I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.

The above document has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)

If you wrote that on the iPad I commend you! That's impressive!

You might want to refetch it @xexyl.

No worries. I just saved the link so I can open it and read it as it stands but thanks for the notice.

xexyl commented 2 years ago

Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?

Yes the JSON surrounding "'s are lexical.

Which means that somewhere they have to be removed so where is the right place? To be clear I'mr referring to the fact that yytext will include the outer "s when a string is matched.

I went to test it but I do see there's a problem with the parser syntax so I can't test it. I might work on that next but it won't be today.

JSON surrounding "'s are lexical just like {, }, :, [, ], , are lexical.

Of course, the tricky bit is that now every instance of those lexical characters are lexical. If they appear inside a JSON encoded string, the are text to be decoded:

{
"foo" : "{}[]:,"
}

I'll have to address this later possibly once I have the parser correct in the above mentioned way.

lcn2 commented 2 years ago

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

Goo idea, @xexyl, commit 463ec301f54d80f91144d98e13747d8333dc8a68 just did that.

xexyl commented 2 years ago

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

Goo idea, @xexyl, commit 463ec30 just did that.

Thanks.

xexyl commented 2 years ago

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

Goo idea, @xexyl, commit 463ec30 just did that.

Thanks.

I just pushed in the current pull request a fix - you forgot to update jint.c and jfloat.c.

lcn2 commented 2 years ago

No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense.

String to numeric conversion functions in C such as strtol() ignore leading and trailing whitespace. So functions such as calloc_json_conv_int_str() that such such functions pay attention to this fact. Not that we expect the JSON parser to pass leading and trailing whitespace around strings with numbers. This process is there just in case is happens.

xexyl commented 2 years ago

Sigh. I just noticed a problem with my pull request possibly depending on how it will merge.

I modified json.h AND json.c it appears so the function prototypes might be changed back. Not sure how good GitHub is with that. It doesn't appear to have any conflicts but not sure if it will be an actual problem or not.

xexyl commented 2 years ago

No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense.

String to numeric conversion functions in C such as strtol() ignore leading and trailing whitespace. So functions such as calloc_json_conv_int_str() that such such functions pay attention to this fact. Not that we expect the JSON parser to pass leading and trailing whitespace around strings with numbers. This process is there just in case is happens.

Right. They do and no the parser would not pass it because of the regex as I suggested.

lcn2 commented 2 years ago

Sigh. I just noticed a problem with my pull request possibly depending on how it will merge.

I modified json.h AND json.c it appears so the function prototypes might be changed back. Not sure how good GitHub is with that. It doesn't appear to have any conflicts but not sure if it will be an actual problem or not.

We don't see that the json.h AND json.c function prototypes were impacted.

Moreover GitHub's json.h and json.c seems to be OK.

xexyl commented 2 years ago

Sigh. I just noticed a problem with my pull request possibly depending on how it will merge. I modified json.h AND json.c it appears so the function prototypes might be changed back. Not sure how good GitHub is with that. It doesn't appear to have any conflicts but not sure if it will be an actual problem or not.

We don't see that the json.h AND json.c function prototypes were impacted.

Moreover GitHub's json.h and json.c seems to be OK.

I see the same. I do have another issue to resolve but I'm not sure if that's possible. Won't know until I get the backup drive out tomorrow though it's annoying me enough that I might do it today if I have the time and patience.

xexyl commented 2 years ago

Just a quick update: I'm heading off for some sleep. Tomorrow I don't have anything going on but I'm probably still going to take it fairly easy. I will probably do something here but I'm not sure what yet. I look forward to seeing what you come up with in the meantime!

Hope you have a great rest of your day my friend! Good night!

lcn2 commented 2 years ago

With commit 16ae660339cd3729652ce3221785042c2e2d07aa the dynamic array facility has been upgraded from the 2014 code.

Checkpoint dynamic array v1.4 2022-04-17

Using consistent function and macro names that start with dyn_array_

Always allocate an additional guard chunk at end of array.

Renamed dyn_alloc_test to       dyn_test
Renamed dyn_alloc.c to          dyn_array.c
Renamed dyn_alloc.h to          dyn_array.h
Renamed dyn_alloc_test.c to     dyn_test.c
Renamed dyn_alloc_test.h to     dyn_test.h

Renamed grow_dyn_arra() to      dyn_array_grow()
Renamed create_dyn_array() to   dyn_array_create()
Renamed append_value() to       dyn_array_append_value()
Renamed append_array() to       dyn_array_append_array()
Renamed clear_dyn_array() to    dyn_array_clear()
Renamed free_dyn_array(() to    dyn_array_free()

Added dyn_array_seek():
        /*
         * dyn_array_seek - set the elements in use on a dynamic array
         *
         * given:
         *      array           - pointer to the dynamic array
         *      offset          - offset in elements
         *      whence          - SEEK_SET ==> offset from the dynamic array beginning
         *                        SEEK_CUR ==> offset from the current elements in use
         *                        SEEK_END ==> offset from the end of allocated elements
         *
         * returns:
         *      true ==> address of the array of elements moved during realloc()
         *      false ==> address of the elements array did not move
         *
         * Attempting to "seek" to or before the beginning of the array will have the effect
         * of calling dyn_array_clear().
         *
         * NOTE: This function does not return on error.
         */
Added dynamic array convenience macros:

    Obtain an element in a dynamic array:
            struct dyn_array *array_p;
            double value;

            value = dyn_array_value(array_p, double, i);
    Obtain the address of an element in a dynamic array:
            struct dyn_array *array_p;
            struct json *addr;

            addr = dyn_array_addr(array_p, struct json, i);
    Current element count of dynamic array:
            struct dyn_array *array_p;
            intmax_t pos;

            pos = dyn_array_tell((array_p);
    Address of the element just beyond the elements in use:
            struct dyn_array *array_p;
            struct json_member *next;

            next = dyn_array_beyond(array_p, struct json_member);
    Number of elements allocated in memory for the dynamic array:
            struct dyn_array *array_p;
            intmax_t size;

            size = dyn_array_alloced((array_p);
    Number of elements available (allocated but not in use) for the dynamic array:
            struct dyn_array *array_p;
            intmax_t avail;

            avail = dyn_array_avail((array_p);
    Rewind a dynamic array back to zero elements:
            struct dyn_array *array_p;

            dyn_array_rewind(array_p);

TODO: Change read_all() to use the dynamic array facility.

TODO: Use dynamic array facility to help build compound JSON parse nodes

lcn2 commented 2 years ago

With commit 987579eabae7688637dc2611fd92e0665cdda338 the read_all() now uses the dynamic array facility.

TODO: Use dynamic array facility to help build compound JSON parse nodes ...

... after sleep and after writing a presentation for the Cal State University math department that is due soon.

xexyl commented 2 years ago

Yes the JSON surrounding "'s are lexical.

Which means that somewhere they have to be removed so where is the right place? To be clear I'mr referring to the fact that yytext will include the outer "s when a string is matched.

I went to test it but I do see there's a problem with the parser syntax so I can't test it. I might work on that next but it won't be today.

The problem appears to be the regex for strings and I'm not sure where it goes wrong. I'm too tired to even think about it much and certainly cannot solve it now but it's evident that it's the problem and that's why I'm getting the unexpected tokens.

I have in the scanner to print each type and value before returning the token type and string: ... is not shown. In the input file:

{
"foo" : "bar"
}

I get:

"foo" 
equals/colon: ':'
...

but "foo" should be prefixed by string:.

Just thought you'd like an update on the what is going on. Still not sure why and probably won't work on it today but this should at least be what's addressed next on the way of the scanner/parser.

I did add a bit of output in the scanner and parser but I've not committed that and I certainly won't until I'm more awake (if at all today). I think and hope I fixed the fork but I'm not sure of that until I do a pull after you merge my pull request and I won't be satisfied with it until several are done and all is okay.

EDIT: I'll look at your other comments later on today or in the next day or two.

UPDATE 0: I don't expect that I'll be able to read the comments above today but please do keep me updated on the features you add; I'll read them when I can in the coming days. I also don't plan on doing anything else today with the pull request. It's possible I do but I don't plan on it. Hope you're having a nice sleep and I hope the maths things go well!

UPDATE 1: I don't expect to do any other updates aside from the commit I just pushed with some typo fixes, that is!

xexyl commented 2 years ago

With commit 16ae660 the dynamic array facility has been upgraded from the 2014 code.

Checkpoint dynamic array v1.4 2022-04-17


Using consistent function and macro names that start with dyn_array_

[...]

This looks good. I'll let you know if I have any thoughts/questions on it when I actually look at the code. I did make some typo fixes in the header file which I'll commit next time I work on the repo (also made a typo fix in jtype.l) but that might not be today.

Thanks for writing this and giving the thorough comments and examples.

lcn2 commented 2 years ago

TODO: Use dynamic array facility to help build compound JSON parse nodes

We are getting ready to do this in a day or so.

xexyl commented 2 years ago

TODO: Use dynamic array facility to help build compound JSON parse nodes

We are getting ready to do this in a day or so.

Sounds good.

EDIT: Correction. The pending comment was on something else which I just posted.

xexyl commented 2 years ago

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

Goo idea, @xexyl, commit 463ec30 just did that.

Thanks.

I just pushed in the current pull request a fix - you forgot to update jint.c and jfloat.c.

I believe that there are other functions that now use calloc() that could be renamed to have the prefix calloc_. Not sure of the comments about the malloced though. I added that word to my vim spell file but I've often wondered about changing the word to allocated. Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be zeroed or cleared or empty or something like that.

What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.

xexyl commented 2 years ago

The problem appears to be the regex for strings and I'm not sure where it goes wrong. I'm too tired to even think about it much and certainly cannot solve it now but it's evident that it's the problem and that's why I'm getting the unexpected tokens.

I was hoping to tackle this today but I didn't get very far. Hopefully I can look at it more in the coming days. I'm kind of wondering about changing it to the more simplistic form which doesn't have all the rules of JSON so that at least the parser can be tested for normal cases as right now it won't proceed due to this regex not being correct.

I won't worry about that today but what do you think about this? Or if you have any thoughts on why it might not be correct that would be even better. Your perl grep tool seems to work okay with it but it doesn't with flex.

xexyl commented 2 years ago

UPDATE: While your JTYPE_STRING does need the surrounding "'s what you provide to, say, malloc_json_conv_string() will be one byte beyond the first surrounding " and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding ".

Are you saying that the "s do not need to be stripped prior to passing it to that function? That would be helpful.

EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.

lcn2 commented 2 years ago

Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be zeroed or cleared or empty or something like that.

What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.

There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running calloc() or using memset() to zeroize sections of memory causes massive churn in the memory / VM system.

So the dynamic array facility won't zeroize by default.

TODO: We need to add macros to dyn_alloc.h for backward compatibility, BTW.

Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.

We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)

lcn2 commented 2 years ago

UPDATE: While your JTYPE_STRING does need the surrounding "'s what you provide to, say, malloc_json_conv_string() will be one byte beyond the first surrounding " and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding ".

Are you saying that the "s do not need to be stripped prior to passing it to that function? That would be helpful.

EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.

We recommend that a wrapper function be given the "JSON/tencoded/tstring\n" (with a length that includes both "'s): a block of memory that includes the ",s AND that the wrapper function verify that the first and last characters are indeed " AND then calls the malloc_json_conv_string() function with a arg that points one byte beyond the 1st " and a length that is -2 of the original length.

This way you don't have to pre-strip "'s in your regexp.

xexyl commented 2 years ago

Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be zeroed or cleared or empty or something like that. What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.

There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running calloc() or using memset() to zeroize sections of memory causes massive churn in the memory / VM system.

Sorry. I meant clearer. I was talking about the word only.

So the dynamic array facility won't zeroize by default.

TODO: We need to add macros to dyn_alloc.h for backward compatibility, BTW.

Sounds good.

Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.

Good point. But still could be zeroed or something like that. I'm only talking about using words that are in the dictionary. Or is zeroi[sz]e actually a term that I'm computer unfamiliar with? I've seen different ways of saying the same thing but never this one.

We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)

And that's what I'm getting at exactly. As you're sure to know I don't use Merriam Webster - I use OED and if I didn't use that I would use another British English one - so I've never heard of it. More to the point though I was thinking of this for vim spelling more than anything else - though other reasons apply as well.

xexyl commented 2 years ago

UPDATE: While your JTYPE_STRING does need the surrounding "'s what you provide to, say, malloc_json_conv_string() will be one byte beyond the first surrounding " and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding ".

Are you saying that the "s do not need to be stripped prior to passing it to that function? That would be helpful. EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.

We recommend that a wrapper function be given the "JSON/tencoded/tstring\n" (with a length that includes both "'s): a block of memory that includes the ",s AND that the wrapper function verify that the first and last characters are indeed " AND then calls the malloc_json_conv_string() function with a arg that points one byte beyond the 1st " and a length that is -2 of the original length.

Is there a reason to not have the function do it itself though? Or have a boolean that if true does it? That would seem like the cleaner approach to me but that's without knowing the full purpose (and all the uses) of the functions as well as only briefly looking at the function a while back.

To be clear: I mean have the function strip them off.

This way you don't have to pre-strip "'s in your regexp.

Right. Though there still exists the problem of the regex not being correct.

EDIT: Of course the wrapper function you have seems practical as well if there's a reason to not have it built into the other function.

lcn2 commented 2 years ago

UPDATE: While your JTYPE_STRING does need the surrounding "'s what you provide to, say, malloc_json_conv_string() will be one byte beyond the first surrounding " and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding ".

Are you saying that the "s do not need to be stripped prior to passing it to that function? That would be helpful.

EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.

We recommend that a wrapper function be given the "JSON/tencoded/tstring\n" (with a length that includes both "'s): a block of memory that includes the ",s AND that the wrapper function verify that the first and last characters are indeed " AND then calls the malloc_json_conv_string() function with a arg that points one byte beyond the 1st " and a length that is -2 of the original length.

Is there a reason to not have the function do it itself though? Or have a boolean that if true does it? That would seem like the cleaner approach to me but that's without knowing the full purpose (and all the uses) of the functions as well as only briefly looking at the function a while back.

This way you don't have to pre-strip "'s in your regexp.

Right. Though there still exists the problem of the regex not being correct.

Which would you prefer, adding a strip boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding the strip boolean as a final argument to those functions would be cleaner.

xexyl commented 2 years ago

Which would you prefer, adding a strip boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding the strip boolean as a final argument to those functions would be cleaner.

I think it would be cleaner to add it to the function as well yes.

lcn2 commented 2 years ago

Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be zeroed or cleared or empty or something like that.

What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.

There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running calloc() or using memset() to zeroize sections of memory causes massive churn in the memory / VM system.

Sorry. I meant clearer. I was talking about the word only.

So the dynamic array facility won't zeroize by default.

TODO: We need to add macros to dyn_alloc.h for backward compatibility, BTW.

Sounds good.

Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.

Good point. But still could be zeroed or something like that. I'm only talking about using words that are in the dictionary. Or is zeroi[sz]e actually a term that I'm computer unfamiliar with? I've seen different ways of saying the same thing but never this one.

We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)

And that's what I'm getting at exactly. As you're sure to know I don't use Merriam Webster - I use OED and if I didn't use that I would use another British English one - so I've never heard of it. More to the point though I was thinking of this for vim spelling more than anything else - though other reasons apply as well.

Most dictionaries lag behind common usage: they attempt to describe (instead of proscribe) the language on the date of their publication.

So not having the zeroize: i.e., not having a zeroize definition for such a term of art such as zeroize is not surprising. :-)