Closed xexyl closed 1 year ago
In support of this issue, we plan to add a dynamic array interface. This will be needed to support building structures for JSON arrays, for example.
In support of this issue, we plan to add a dynamic array interface. This will be needed to support building structures for JSON arrays, for example.
Sounds good. We can then discuss it more here.
Hope you're having a nice sleep my friend!
EDIT: I hope to in the coming days write some information on where I am with this as in what is done so far and what I'm thinking on etc. It might very well not be until the middle of next week but we'll see.
We will assume that compound JSON aspects (such as arrays, objects with members, member) we will presume that their sub-components will be parsed first before the parent object is parsed.
For example the JSON parser will need to parse the zero or more elements of an array before the parent JSON array is finally parsed. So when parsing:
[
"foo" : "bar",
"curds" : 123,
"when" : { "fizz" : "bin" }
]
The array object will be completely parsed after the 3 elements are parsed, so the struct json_array
will be finalized after the three struct json_member
structures (the last one with a value
that points to another struct json_member
) are finalized.
CORRECTION: The cut and paste of info was botched, sorry (tm Canada :) ). Reentering this later with more info.
We will assume that compound JSON aspects (such as arrays, objects with members, member) we will presume that their sub-components will be parsed first before the parent object is parsed.
Not sure if I understand this sentence particularly the first part. Mind clarifying?
For example the JSON parser will need to parse the zero or more elements of an array before the parent JSON array is finally parsed. So when parsing:
[ "foo" : "bar", "curds" : 123, "when" : { "fizz" : "bin" } ]
The array object will be completely parsed after the 3 elements are parsed, so the
struct json_array
will be finalized after the threestruct json_member
structures (the last one with avalue
that points to anotherstruct json_member
) are finalized.
I'll wait and see what you have in mind with the array code as it'll allow me to better come up with any questions/thoughts (I think).
Okay I'll be off for the day now. I might have a chance to reply to messages later but not sure of that. Tomorrow morning I will have a bit of time but later in the morning until the end of the day I won't be able to interact with you so I hope you have a great Saturday. I am sure I'll do some things or at least reply to some messages though.
Good day!
Not sure if I understand this sentence particularly the first part. Mind clarifying?
Cut and paste of text on a iPhone failed, sorry (tm Canada :) ).
When a compound JSON item is parsed, such as a JSON array, the sub-items of the compound JSON item need to be parsed before the parse of the compound JSON item is completed.
What is not clear, because it depends on how bison
/ flex
generated C code works, is when things happen.
Take this compound JSON item:
"curds" : 123
Somewhere in the timeline, the following JSON:
"curds"
will be identified as a string and then sometime malloc_json_conv_string()
will be called with the curds string.
The struct json_string
pointer returned by malloc_json_conv_string()
will be loaded into a struct json
and the type will be set to JTYPE_STRING
by a (as yet to be written) function.
And somewhere in the timeline, the following JSON:
123
will be identified as an integer and then then sometime malloc_json_conv_int()
will be called with the 123 string.
The struct json_integer
pointer returned by malloc_json_conv_int()
will be loaded into a struct json
and the type will be set to JTYPE_INT
by a (as yet to be written) function.
So now you have two struct json
structures, one for "curds
" and one for 123
. One is the name and one is the value for a struct json_member
.
The JSON parser, have identified a member will call by a (as yet to be written) function that creates struct json_member
and sets the name and value accordingly. This struct json_member
pointer returned by a by a (as yet to be written) function will be loaded into a struct json
and the type will be set to _JTYPEMEMBER.
From the above comment, functions such as the malloc_json_conv_int()
function should return a struct json
pointer (not a struct json_integer
pointer).
Thestruct json
should also be filled out in themalloc_json_conv_int()
function should do so along the following concept:
...
struct json_integer *foo;
struct json *ret;
...
ret = calloc(1, sizeof(struct json));
if (ret == NULL) {
...
}
...
foo = calloc(1, sizeof(struct json_integer));
if (foo == NULL) {
...
}
...
ret->type = JTYPE_INT;
ret->element.integer = foo;
ret->parent = NULL;
ret->prev = NULL;
ret->next = NULL;
...
return ret;
OK, with better variable names and more comments 👍, but hopefully you get the concept.
We will start to make the needed changes.
Good morning my friend! I hope you're having a nice sleep though given the time; I cannot sleep so I decided to sit up.
Not sure if I understand this sentence particularly the first part. Mind clarifying?
Cut and paste of text on a iPhone failed, sorry (tm Canada :) ).
I understand that all too well but I think we all do.
When a compound JSON item is parsed, such as a JSON array, the sub-items of the compound JSON item need to be parsed before the parse of the compound JSON item is completed.
What is not clear, because it depends on how
bison
/flex
generated C code works, is when things happen.
That's what I was thinking with the sentence above this one (okay, the sentence above the sentence above this one! :) ): how could I force the order?
On the other hand though it makes sense now that I've read it after a long sleep (though I wish I was still asleep as I'll be awake many hours today though for good reasons).
Take this compound JSON item:
"curds" : 123
Somewhere in the timeline, the following JSON:
"curds"
will be identified as a string and then sometime
malloc_json_conv_string()
will be called with the curds string.
Right.
The
struct json_string
pointer returned bymalloc_json_conv_string()
will be loaded into astruct json
and the type will be set toJTYPE_STRING
by a (as yet to be written) function.
Okay. That makes sense. Actually though I'm still unsure how to use the:
%union json_type {
struct json *json;
...
}
because in the lexer I'd have e.g.:
yylval->json
but this means it has to be allocated ... and then inside that there has to be the right allocations and assignments.
Thus I think I have to change it to:
%union json_type {
struct json json;
...
}
and that would greatly simplify things. In fact I'll try doing that today (if I remember to do so after I write my replies to your comments).
And somewhere in the timeline, the following JSON:
123
will be identified as an integer and then then sometime
malloc_json_conv_int()
will be called with the 123 string.The
struct json_integer
pointer returned bymalloc_json_conv_int()
will be loaded into astruct json
and the type will be set toJTYPE_INT
by a (as yet to be written) function.So now you have two
struct json
structures, one for "curds
" and one for123
. One is the name and one is the value for astruct json_member
.
Right. I was thinking of that as I was reading the above as well and not sure how to address that.
The JSON parser, have identified a member will call by a (as yet to be written) function that creates
struct json_member
and sets the name and value accordingly. Thisstruct json_member
pointer returned by a by a (as yet to be written) function will be loaded into astruct json
and the type will be set to _JTYPEMEMBER.
That sounds reasonable. If you're up to doing that that would be of help: I mean writing the function that sets the name and value in struct json_member *
. Is that something you had in mind to do?
From the above comment, functions such as the
malloc_json_conv_int()
function should return astruct json
pointer (not astruct json_integer
pointer).
That's an interesting point too because in the book they use a cast as well so that they have a single type (but cast when necessary). On the other hand since struct json
will have the union and the type (for the union) it might not need a cast in that case. What will have to change, of course, is code that calls the function(s).
The
struct json
should also be filled out in themalloc_json_conv_int()
function should do so along the following concept:... struct json_integer *foo; struct json *ret; ... ret = calloc(1, sizeof(struct json)); if (ret == NULL) { ... } ... foo = calloc(1, sizeof(struct json_integer)); if (foo == NULL) { ... } ... ret->type = JTYPE_INT; ret->element.integer = foo; ret->parent = NULL; ret->prev = NULL; ret->next = NULL; ... return ret;
OK, with better variable names and more comments 👍, but hopefully you get the concept.
Of course.
We will start to make the needed changes.
Thanks! That will be of great help!
Now as for the JTYPE_
versus JSON_
I'm still not sure what to do. I wish you or I had thought of opening a separate issue (this issue) earlier on because I had noted in the other thread an idea I had about this subject but finding it might be difficult: though perhaps in Mail.app I can do a search in that thread for JTYPE_
.
The problem is of course that the parser needs token names as well and I figured the prefix JSON_
would be ideal. However I'm starting to wonder if for the parser that JTYPE_
would be better (esp since the code nobody will look at .. without nightmares at least :)) ) and the parse tree related struct/union should have JSON_
prefix. What do you think about that? Maybe I should do that. On the other hand maybe there's a better solution. Any thoughts on that?
UPDATE 0: commit cccbff7 takes care of the changing the struct json *
to struct json
as well as the above prefix changes. I trust that you'll appreciate and hopefully laugh at the commit log wrt the prefixes:
The JTYPE_ enum in jparse.h have been renamed with the prefix JSON_ and
the JSON_ in the lexer and parser have been prefixed instead with JTYPE_
as nobody in their right mind will look at the parser or scanner code
but people will look at the other code. Actually nobody in their left
(wrong) mind would look at the generated code either because anyone who
does will have horrible nightmares about spaghetti code with very long
tentacles attached to the pasta that tries to grab them and pull them
into its mouth before consuming them with insatiable craving for humans
in its eyes. Okay you get the point: JTYPE_ is a better name for the
ugly, offensive code that can't be maintained whereas JSON_ is better
for code that can be maintained and is not ugly or offensive.
UPDATE 1: I don't plan on doing anything else but as it's still early it's possible I do. Tomorrow will be a slow day almost certainly though if nothing else we can discuss some things.
With commit 8f9e9f9621a1d4d2308776caa1609bd712e6f234 we have added code to generate JSON parse tree nodes from the fundament JSON types:
malloc_json_conv_int()
malloc_json_conv_float
malloc_json_conv_string()
malloc_json_conv_bool()
malloc_json_conv_null()
Each of the above interfaces are given a pointer to the beginning character of the JSON item and a length. This pointer/length interface allows one to point to text within a larger character block. The text does NOT need to be NUL but terminated.
Consider this JSON:
{ "foo" : 123 }
For the sake of this example, assume:
char *foo = "{ \"foo\" : 123 }";
/* ^-- bar == foo+10 */
char *bar = foo+10; /* point at the '1' in the above JSON string */
One may call malloc_json_conv_int()
as follows:
struct json *node = NULL;
...
node = malloc_json_conv_int(bar, 3);
These _malloc_json_convfoo() functions return a pointer to a struct json
, a node in the JSON parse tree.
The JSON parse tree node is NOT linked. I.e., the parent
, prev
, and next
pointers are NULL.
This is because it will the responsibility of functions caller to link in the JSON parse tree into the overall parse tree.
The _malloc_json_convfoo() functions will never return NULL. In the extreme case of a malloc()
(well actually calloc()
), these functions will NOT return.
For each _malloc_json_convfoo() function, there is a _malloc_json_conv_foostr() function. The _malloc_json_conv_foostr() function interface is given a pointer to a NUL terminated string:
char *bar = "123";
struct json *node = NULL;
...
node = malloc_json_conv_int(bar, NULL);
Like _malloc_json_convfoo() functions , they will NOT return NULL. And as before, in the extreme case of a malloc()
(well actually calloc()
), these functions will NOT return.
The _malloc_json_conv_foostr() function interface includes a pointer to a string size, instead of a string length. If that pointer is NULL (as was in the above example) no length is stored. However one can be given the length of the string as a side effect of the call as in:
char *bar = "123";
struct json *node = NULL;
size_t len;
...
node = malloc_json_conv_int(bar, &len);
One should use _malloc_json_convfoo() functions to point to a substring inside a larger JSON block, and use _malloc_json_conv_foostr() function interface to point to a C string that contains the JSON item in question.
Neither the _malloc_json_convfoo() functions nor the _malloc_json_conv_foostr() functions modify the text they are given. Moreover each one of them malloced a C string containing a copy of the JSON item. In each of the above examples, node->as_str
points to a copy of the JSON item as a NUL terminated C string. Moreover, node->as_str_len
contains the length of that NUL terminated C string.
Each JSON parse tree node as an important structure member: node->converted
. This is a boolean that indicates if the conversion function was able to convert or not. If node->converted == false
, then the conversion function was NOT possible.
When node->converted == false
, only as_str
and as_str_len
contain valid values. All other aspects of the JSON element union are invalid. Thus before you use other aspects of the JSON element union, you MUST check the node->converted
value first.
The malloc_json_conv_string()
and malloc_json_conv_string_str()
interfaces include an final boolean argument to indicate off the JSON string conversion should be done under "strict" rules (or NOT).
The JSON string conversion does NOT include the JSON double quotes. Thus:
struct json *node = NULL;
char *foo = "foo";
char *bar = "\"bar\"";
...
node = malloc_json_conv_string_str(foo, NULL, false); /* correct */
...
node = malloc_json_conv_string_str(bar, NULL, false); /* incorrect */
The JSON integer and floating point conversion routines will ignore leading and trailing whitespace. Normally in a JSON document, there should be NO such leading and trailing whitespace. However the C string the numerical value functions ignore leading and trailing whitespace. The node->as_str
will contain a malloced copy of the string with any whitespace stripped off. The orig_len
will give the original length (pre-stripping) of the string, whereas as_str_len
contains the length of the duplicated string that might be stripped. Thus is node-> orig_len == node-> as_str_len
then the original JSON string containing the numeric value did NOT contain any leading and trailing whitespace.
We plan to work on functions that form compound JSON nodes such as JSON_OBJECT
, JSON_MEMBER
, and JSON_ARRAY
next.
With commit 3179f21b1047602756df1532300caf4506392d47 the dynamic array facility as been added.
TODO: Change read_all() to use dynamic array facility, then
finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT
, JSON_MEMBER
, and JSON_ARRAY
).
With commit 8f9e9f9 we have added code to generate JSON parse tree nodes from the fundament JSON types:
- integer -
malloc_json_conv_int()
- floating point -
malloc_json_conv_float
- string -
malloc_json_conv_string()
- boolean -
malloc_json_conv_bool()
- null -
malloc_json_conv_null()
Each of the above interfaces are given a pointer to the beginning character of the JSON item and a length. This pointer/length interface allows one to point to text within a larger character block. The text does NOT need to be NUL but terminated.
I guess the but was accidentally added; looking at the code I can only presume so since the length is passed in as a parameter and that's used in the call to malloc
. A question for you on the subject of malloc
though:
Is there a reason throughout the code you use malloc
instead of calloc
? I always use calloc
as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc
? And in at least one function you use both I see (the string one: the only one I took a look at).
Consider this JSON:
{ "foo" : 123 }
For the sake of this example, assume:
char *foo = "{ \"foo\" : 123 }"; /* ^-- bar == foo+10 */ char *bar = foo+10; /* point at the '1' in the above JSON string */
Right. This makes sense.
One may call
malloc_json_conv_int()
as follows:struct json *node = NULL; ... node = malloc_json_conv_int(bar, 3);
Of course we can't know the length of the number like that. However in the lexer we could assign a size_t
the value of strlen(yytext)
so that we do have the length. But perhaps you have built its here this is not actually necessary?
These _malloc_json_convfoo() functions return a pointer to a
struct json
, a node in the JSON parse tree. The JSON parse tree node is NOT linked. I.e., theparent
,prev
, andnext
pointers are NULL. This is because it will the responsibility of functions caller to link in the JSON parse tree into the overall parse tree.
Okay. I will have to think on how to do this. It might be obvious but not in my current state: I was awake quite late for me and awake at 0400 so I'm not awake enough. I probably won't fully process this until I start looking at the code and maybe the book. I actually do think chapter 3 will be of value: it does create a tree for the calculator mostly through bison. Of course other chapters will also be of value but I think chapter 3 will be quite valuable.
The _malloc_json_convfoo() functions will never return NULL. In the extreme case of a
malloc()
(well actuallycalloc()
), these functions will NOT return.
This makes sense.
For each _malloc_json_convfoo() function, there is a _malloc_json_conv_foostr() function. The _malloc_json_conv_foostr() function interface is given a pointer to a NUL terminated string:
Good to know this. I just looked at the string version and I see it's just a simplified form.
char *bar = "123"; struct json *node = NULL; ... node = malloc_json_conv_int(bar, NULL);
Is there a mistake here? Because the function prototype looks like:
struct json *
malloc_json_conv_int(char const *str, size_t len)
Ah. Perhaps you made a typo for malloc_json_conv_int_str
? That seems what it is (and below suggests it too) but want to be sure I am not missing something.
Like _malloc_json_convfoo() functions , they will NOT return NULL. And as before, in the extreme case of a
malloc()
(well actuallycalloc()
), these functions will NOT return.The _malloc_json_conv_foostr() function interface includes a pointer to a string size, instead of a string length. If that pointer is NULL (as was in the above example) no length is stored. However one can be given the length of the string as a side effect of the call as in:
char *bar = "123"; struct json *node = NULL; size_t len; ... node = malloc_json_conv_int(bar, &len);
One should use _malloc_json_convfoo() functions to point to a substring inside a larger JSON block, and use _malloc_json_conv_foostr() function interface to point to a C string that contains the JSON item in question.
Would you please give an example of each if you have any in mind?
Neither the _malloc_json_convfoo() functions nor the _malloc_json_conv_foostr() functions modify the text they are given. Moreover each one of them malloced a C string containing a copy of the JSON item. In each of the above examples,
node->as_str
points to a copy of the JSON item as a NUL terminated C string. Moreover,node->as_str_len
contains the length of that NUL terminated C string.
Makes sense.
Each JSON parse tree node as an important structure member:
node->converted
. This is a boolean that indicates if the conversion function was able to convert or not. Ifnode->converted == false
, then the conversion function was NOT possible.
In which case we cannot rely on the struct being valid.
When
node->converted == false
, onlyas_str
andas_str_len
contain valid values. All other aspects of the JSON element union are invalid. Thus before you use other aspects of the JSON element union, you MUST check thenode->converted
value first.
Right.
The
malloc_json_conv_string()
andmalloc_json_conv_string_str()
interfaces include an final boolean argument to indicate off the JSON string conversion should be done under "strict" rules (or NOT).
I noticed the boolean but I haven't really looked at the code to see what that means - yet.
The JSON string conversion does NOT include the JSON double quotes. Thus:
struct json *node = NULL; char *foo = "foo"; char *bar = "\"bar\""; ... node = malloc_json_conv_string_str(foo, NULL, false); /* correct */ ... node = malloc_json_conv_string_str(bar, NULL, false); /* incorrect */
Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "
s or else have a function that does this. What are your thoughts here? This btw is because of the regex:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"
Of course we don't want to include the "
s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer "
s? I guess that might be a good idea but only strip off the outermost ones.
The JSON integer and floating point conversion routines will ignore leading and trailing whitespace. Normally in a JSON document, there should be NO such leading and trailing whitespace. However the C string the numerical value functions ignore leading and trailing whitespace. The
node->as_str
will contain a malloced copy of the string with any whitespace stripped off. Theorig_len
will give the original length (pre-stripping) of the string, whereasas_str_len
contains the length of the duplicated string that might be stripped. Thus isnode-> orig_len == node-> as_str_len
then the original JSON string containing the numeric value did NOT contain any leading and trailing whitespace.
No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense.
With commit 3179f21 the dynamic array facility as been added.
I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.
TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (
JSON_OBJECT
,JSON_MEMBER
, andJSON_ARRAY
).
I'll wait on this to be done before I ask anything or make any comments.
What else has to be done btw?
With commit 3179f21 the dynamic array facility as been added.
I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.
TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (
JSON_OBJECT
,JSON_MEMBER
, andJSON_ARRAY
).I'll wait on this to be done before I ask anything or make any comments.
What else has to be done btw?
We have more edits along the way in dyn_alloc.c
and den_alloc.h
to better support read_all()
. We then will probably expand on dyn_alloc_trest
to use read_all()
to read into memory, the binary of itself.
Once that has been coded and tested, then we will be ready to code the JSON compound functions.
Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).
The duplicate the JSON integer string as done using malloc()
and will be changed to use calloc()
.
Thanks @xexyl.
Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).
The duplicate the JSON integer string as done using
malloc()
and will be changed to usecalloc()
.Thanks @xexyl.
Sure. There are other places that use malloc
as well and maybe those should use calloc()
also? In the code I've added (as I think I might have said) I use calloc()
so those don't need to be converted.
Should the functions be renamed too to reflect that they use calloc()
instead of malloc()
?
With commit 3179f21 the dynamic array facility as been added.
I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.
TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (
JSON_OBJECT
,JSON_MEMBER
, andJSON_ARRAY
).I'll wait on this to be done before I ask anything or make any comments. What else has to be done btw?
We have more edits along the way in
dyn_alloc.c
andden_alloc.h
to better supportread_all()
. We then will probably expand ondyn_alloc_test
to useread_all()
to read into memory, the binary of itself.Once that has been coded and tested, then we will be ready to code the JSON compound functions.
Sounds good. I eagerly await to see what you come up with! I'll probably be a few days before I can do anything much but I'm sure each day I'll do a bit.
Would you please give an example of each if you have any in mind? ... Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"
Of course we don't want to include the
"
s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer"
s? I guess that might be a good idea but only strip off the outermost ones.
The point of the example was to show that the JSON surrounding "
's should not be given to malloc_json_conv_string()
nor to malloc_json_conv_string_str()
.
Assume this JSON document in memory and is pointed by ptr
:
{ "foobar" : 12379 }
So you don't have to count, by hand, bytes in the above string, here are a few facts about that string:
ptr+0
refers to the {
.ptr+2
refers to the first "
. ptr+3
refers to the f
.ptr+8
refers to the f
.ptr+9
refers to the second "
.ptr+13
refers to the 1
.ptr+17
refers to the 9
.ptr+19
refers to the }
.ptr+20
refers to the NUL at the end of the JSON document placed there by read_all()
.In this case you would NOT want to use malloc_json_conv_string_str()
because the string in memory is NOT NUL byte terminated at ptr+9
.
The conversion call you would want to make (using those above hand counted byte addresses :) ) is:
node = malloc_json_conv_string(ptr+3, 6, strict);
NOTE: The boolean strict
may be true
or false
depending on if strict JSON is in effect.
After the above call:
node->type == JSON_STRING
node->element.string.as_str
- a malloced string containing just foobar
followed by a NUL bytenode->element.string.as_str_len == 6
node->element.string.converted == true
node->parent == NULL
node->next == NULL
node->prev == NULL
Had node->element.string.converted
been false
, then the ABOVE JSON parse tree structure fields would have been the ONLY JSON parse tree structure fields with valid values. However, because node->element.string.converted == true
we know that the rest of the struct json_string
are valid. In particular:
node->element.string.str
is the decoded string foobar
followed by a NUL bytenode->element.string.str_len == 6
- the length of the decoded stringnode->element.string.same == true
- the decoded string is the same as encoded stringnode->element.string.has_nul == false
- the decoded string does NOT have a NUL byte inside itnode->element.string.slash == false
- no _/ _in the decoded stringnode->element.string.posix_safe == true
- decoded string is "POSIX portable safe plus +"node->element.string.first_alphanum == true
- decoded string start with a alphanumeric characternode->element.string.upper == false
- no UPPER case character are in the decoded stringAccordingly you would NOT want to call malloc_json_conv_int_str(ptr+13, NULL)
because this try to integer encode "`12379 }" which is not want you want. Instead you would call (using those above hand counted byte addresses :) ) is:
node = malloc_json_conv_int(ptr+13, 5);
After the above call:
node->type == JSON_INT
node->element.integer.as_str
- a malloced string containing just foobar
followed by a NUL bytenode->element.integer.as_str_len == 5
node->element.integer.converted == true
node->parent == NULL
node->next == NULL
node->prev == NULL
Had node->element.string.converted
been false
, then the ABOVE JSON parse tree structure fields would have been the ONLY JSON parse tree structure fields with valid values. However, because node->element.integer.converted == true
we know that the rest of the struct json_integer
are valid. In particular:
node->element.integer.orig_len == 5
- original integer string did NOT have leading nor trailing whitespacenode->element.integer.is_negative == false
node->element.integer.int8_sized == false
Because node->element.integer.int8_sized == false
, we know that node->element.integer.as_int8
does NOT contain a valid value and so we ignore it.
Continuing just for a few more of the struct json_integer
fields:
node->element.integer.int16_sized == true
Because node->element.integer.int16_sized == true
, we know the node->element.integer.as_int16
has a valid value:
node->element.integer.as_uint16 == 12379
Containing for a bit (pun intended):
node->element.integer.int32_sized == true
node->element.integer.uint32_sized == true
node->element.integer.int64_sized == true
node->element.integer.uint64_sized == true
node->element.integer.int_sized == true
node->element.integer.uint_sized == true
node->element.integer.as_umaxint == true
Because the above mentioned values are true
, their corresponding values are valid:
node->element.integer.as_int32 == 12379
node->element.integer.as_uint32 == 12379
node->element.integer.as_int64 == 12379
node->element.integer.as_uint64 == 12379
node->element.integer.as_int == 12379
node->element.integer.as_uint == 12379
node->element.integer.as_umaxint == 12379
We hope this helps, @xexyl.
UPDATE: The above example has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)
Would you please give an example of each if you have any in mind? ... Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:
JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"
Of course we don't want to include the
"
s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer"
s? I guess that might be a good idea but only strip off the outermost ones.The point of the example was to show that the JSON surrounding
"
's should not be given tomalloc_json_conv_string()
nor tomalloc_json_conv_string_str()
.
Right but I was getting at how the regex in the scanner will include the "
s so should they be stripped off since the functions should not be passed the outer "
s?
We hope this helps, @xexyl.
I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.
Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?
Yes the JSON surrounding "
's are lexical.
JSON surrounding "
's are lexical just like {
, }
, :
, [
, ]
, ,
are lexical.
Of course, the tricky bit is that now every instance of those lexical characters are lexical. If they appear inside a JSON encoded string, the are text to be decoded:
{
"foo" : "{}[]:,"
}
UPDATE: While your JTYPE_STRING does need the surrounding "
's what you provide to, say, malloc_json_conv_string()
will be one byte beyond the first surrounding "
and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding "
.
I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.
The above document has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)
You might want to refetch it @xexyl.
I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.
The above document has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)
If you wrote that on the iPad I commend you! That's impressive!
You might want to refetch it @xexyl.
No worries. I just saved the link so I can open it and read it as it stands but thanks for the notice.
Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?
Yes the JSON surrounding
"
's are lexical.
Which means that somewhere they have to be removed so where is the right place? To be clear I'mr referring to the fact that yytext
will include the outer "
s when a string is matched.
I went to test it but I do see there's a problem with the parser syntax so I can't test it. I might work on that next but it won't be today.
JSON surrounding
"
's are lexical just like{
,}
,:
,[
,]
,,
are lexical.Of course, the tricky bit is that now every instance of those lexical characters are lexical. If they appear inside a JSON encoded string, the are text to be decoded:
{ "foo" : "{}[]:," }
I'll have to address this later possibly once I have the parser correct in the above mentioned way.
Should the functions be renamed too to reflect that they use calloc() instead of malloc()?
Goo idea, @xexyl, commit 463ec301f54d80f91144d98e13747d8333dc8a68 just did that.
Should the functions be renamed too to reflect that they use calloc() instead of malloc()?
Goo idea, @xexyl, commit 463ec30 just did that.
Thanks.
Should the functions be renamed too to reflect that they use calloc() instead of malloc()?
Goo idea, @xexyl, commit 463ec30 just did that.
Thanks.
I just pushed in the current pull request a fix - you forgot to update jint.c
and jfloat.c
.
No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense.
String to numeric conversion functions in C such as strtol()
ignore leading and trailing whitespace. So functions such as calloc_json_conv_int_str()
that such such functions pay attention to this fact. Not that we expect the JSON parser to pass leading and trailing whitespace around strings with numbers. This process is there just in case is happens.
Sigh. I just noticed a problem with my pull request possibly depending on how it will merge.
I modified json.h
AND json.c
it appears so the function prototypes might be changed back. Not sure how good GitHub is with that. It doesn't appear to have any conflicts but not sure if it will be an actual problem or not.
No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense.
String to numeric conversion functions in C such as
strtol()
ignore leading and trailing whitespace. So functions such ascalloc_json_conv_int_str()
that such such functions pay attention to this fact. Not that we expect the JSON parser to pass leading and trailing whitespace around strings with numbers. This process is there just in case is happens.
Right. They do and no the parser would not pass it because of the regex as I suggested.
Sigh. I just noticed a problem with my pull request possibly depending on how it will merge.
I modified
json.h
ANDjson.c
it appears so the function prototypes might be changed back. Not sure how good GitHub is with that. It doesn't appear to have any conflicts but not sure if it will be an actual problem or not.
We don't see that the json.h
AND json.c
function prototypes were impacted.
Sigh. I just noticed a problem with my pull request possibly depending on how it will merge. I modified
json.h
ANDjson.c
it appears so the function prototypes might be changed back. Not sure how good GitHub is with that. It doesn't appear to have any conflicts but not sure if it will be an actual problem or not.We don't see that the
json.h
ANDjson.c
function prototypes were impacted.
I see the same. I do have another issue to resolve but I'm not sure if that's possible. Won't know until I get the backup drive out tomorrow though it's annoying me enough that I might do it today if I have the time and patience.
Just a quick update: I'm heading off for some sleep. Tomorrow I don't have anything going on but I'm probably still going to take it fairly easy. I will probably do something here but I'm not sure what yet. I look forward to seeing what you come up with in the meantime!
Hope you have a great rest of your day my friend! Good night!
With commit 16ae660339cd3729652ce3221785042c2e2d07aa the dynamic array facility has been upgraded from the 2014 code.
Using consistent function and macro names that start with dyn_array_
Always allocate an additional guard chunk at end of array.
Renamed dyn_alloc_test to dyn_test
Renamed dyn_alloc.c to dyn_array.c
Renamed dyn_alloc.h to dyn_array.h
Renamed dyn_alloc_test.c to dyn_test.c
Renamed dyn_alloc_test.h to dyn_test.h
Renamed grow_dyn_arra() to dyn_array_grow()
Renamed create_dyn_array() to dyn_array_create()
Renamed append_value() to dyn_array_append_value()
Renamed append_array() to dyn_array_append_array()
Renamed clear_dyn_array() to dyn_array_clear()
Renamed free_dyn_array(() to dyn_array_free()
Added dyn_array_seek():
/*
* dyn_array_seek - set the elements in use on a dynamic array
*
* given:
* array - pointer to the dynamic array
* offset - offset in elements
* whence - SEEK_SET ==> offset from the dynamic array beginning
* SEEK_CUR ==> offset from the current elements in use
* SEEK_END ==> offset from the end of allocated elements
*
* returns:
* true ==> address of the array of elements moved during realloc()
* false ==> address of the elements array did not move
*
* Attempting to "seek" to or before the beginning of the array will have the effect
* of calling dyn_array_clear().
*
* NOTE: This function does not return on error.
*/
Added dynamic array convenience macros:
Obtain an element in a dynamic array:
struct dyn_array *array_p;
double value;
value = dyn_array_value(array_p, double, i);
Obtain the address of an element in a dynamic array:
struct dyn_array *array_p;
struct json *addr;
addr = dyn_array_addr(array_p, struct json, i);
Current element count of dynamic array:
struct dyn_array *array_p;
intmax_t pos;
pos = dyn_array_tell((array_p);
Address of the element just beyond the elements in use:
struct dyn_array *array_p;
struct json_member *next;
next = dyn_array_beyond(array_p, struct json_member);
Number of elements allocated in memory for the dynamic array:
struct dyn_array *array_p;
intmax_t size;
size = dyn_array_alloced((array_p);
Number of elements available (allocated but not in use) for the dynamic array:
struct dyn_array *array_p;
intmax_t avail;
avail = dyn_array_avail((array_p);
Rewind a dynamic array back to zero elements:
struct dyn_array *array_p;
dyn_array_rewind(array_p);
TODO: Change read_all() to use the dynamic array facility.
TODO: Use dynamic array facility to help build compound JSON parse nodes
With commit 987579eabae7688637dc2611fd92e0665cdda338 the read_all() now uses the dynamic array facility.
TODO: Use dynamic array facility to help build compound JSON parse nodes ...
... after sleep and after writing a presentation for the Cal State University math department that is due soon.
Yes the JSON surrounding
"
's are lexical.Which means that somewhere they have to be removed so where is the right place? To be clear I'mr referring to the fact that
yytext
will include the outer"
s when a string is matched.I went to test it but I do see there's a problem with the parser syntax so I can't test it. I might work on that next but it won't be today.
The problem appears to be the regex for strings and I'm not sure where it goes wrong. I'm too tired to even think about it much and certainly cannot solve it now but it's evident that it's the problem and that's why I'm getting the unexpected tokens.
I have in the scanner to print each type and value before returning the token type and string: ...
is not shown. In the input file:
{
"foo" : "bar"
}
I get:
"foo"
equals/colon: ':'
...
but "foo"
should be prefixed by string:
.
Just thought you'd like an update on the what is going on. Still not sure why and probably won't work on it today but this should at least be what's addressed next on the way of the scanner/parser.
I did add a bit of output in the scanner and parser but I've not committed that and I certainly won't until I'm more awake (if at all today). I think and hope I fixed the fork but I'm not sure of that until I do a pull after you merge my pull request and I won't be satisfied with it until several are done and all is okay.
EDIT: I'll look at your other comments later on today or in the next day or two.
UPDATE 0: I don't expect that I'll be able to read the comments above today but please do keep me updated on the features you add; I'll read them when I can in the coming days. I also don't plan on doing anything else today with the pull request. It's possible I do but I don't plan on it. Hope you're having a nice sleep and I hope the maths things go well!
UPDATE 1: I don't expect to do any other updates aside from the commit I just pushed with some typo fixes, that is!
With commit 16ae660 the dynamic array facility has been upgraded from the 2014 code.
Checkpoint dynamic array v1.4 2022-04-17
Using consistent function and macro names that start with dyn_array_ [...]
This looks good. I'll let you know if I have any thoughts/questions on it when I actually look at the code. I did make some typo fixes in the header file which I'll commit next time I work on the repo (also made a typo fix in jtype.l
) but that might not be today.
Thanks for writing this and giving the thorough comments and examples.
TODO: Use dynamic array facility to help build compound JSON parse nodes
We are getting ready to do this in a day or so.
TODO: Use dynamic array facility to help build compound JSON parse nodes
We are getting ready to do this in a day or so.
Sounds good.
EDIT: Correction. The pending comment was on something else which I just posted.
Should the functions be renamed too to reflect that they use calloc() instead of malloc()?
Goo idea, @xexyl, commit 463ec30 just did that.
Thanks.
I just pushed in the current pull request a fix - you forgot to update
jint.c
andjfloat.c
.
I believe that there are other functions that now use calloc()
that could be renamed to have the prefix calloc_
. Not sure of the comments about the malloced
though. I added that word to my vim spell file but I've often wondered about changing the word to allocated. Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be zeroed
or cleared
or empty
or something like that.
What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.
The problem appears to be the regex for strings and I'm not sure where it goes wrong. I'm too tired to even think about it much and certainly cannot solve it now but it's evident that it's the problem and that's why I'm getting the unexpected tokens.
I was hoping to tackle this today but I didn't get very far. Hopefully I can look at it more in the coming days. I'm kind of wondering about changing it to the more simplistic form which doesn't have all the rules of JSON so that at least the parser can be tested for normal cases as right now it won't proceed due to this regex not being correct.
I won't worry about that today but what do you think about this? Or if you have any thoughts on why it might not be correct that would be even better. Your perl grep tool seems to work okay with it but it doesn't with flex.
UPDATE: While your JTYPE_STRING does need the surrounding
"
's what you provide to, say,malloc_json_conv_string()
will be one byte beyond the first surrounding"
and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding"
.
Are you saying that the "
s do not need to be stripped prior to passing it to that function? That would be helpful.
EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.
Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be
zeroed
orcleared
orempty
or something like that.What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.
There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running calloc()
or using memset()
to zeroize sections of memory causes massive churn in the memory / VM system.
So the dynamic array facility won't zeroize by default.
TODO: We need to add macros to dyn_alloc.h
for backward compatibility, BTW.
Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.
We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)
UPDATE: While your JTYPE_STRING does need the surrounding
"
's what you provide to, say,malloc_json_conv_string()
will be one byte beyond the first surrounding"
and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding"
.Are you saying that the
"
s do not need to be stripped prior to passing it to that function? That would be helpful.EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.
We recommend that a wrapper function be given the "JSON/tencoded/tstring\n"
(with a length that includes both "
's): a block of memory that includes the "
,s AND that the wrapper function verify that the first and last characters are indeed "
AND then calls the malloc_json_conv_string()
function with a arg that points one byte beyond the 1st "
and a length that is -2
of the original length.
This way you don't have to pre-strip "
's in your regexp.
Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be
zeroed
orcleared
orempty
or something like that. What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running
calloc()
or usingmemset()
to zeroize sections of memory causes massive churn in the memory / VM system.
Sorry. I meant clearer. I was talking about the word only.
So the dynamic array facility won't zeroize by default.
TODO: We need to add macros to
dyn_alloc.h
for backward compatibility, BTW.
Sounds good.
Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.
Good point. But still could be zeroed
or something like that. I'm only talking about using words that are in the dictionary. Or is zeroi[sz]e
actually a term that I'm computer unfamiliar with? I've seen different ways of saying the same thing but never this one.
We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)
And that's what I'm getting at exactly. As you're sure to know I don't use Merriam Webster - I use OED and if I didn't use that I would use another British English one - so I've never heard of it. More to the point though I was thinking of this for vim spelling more than anything else - though other reasons apply as well.
UPDATE: While your JTYPE_STRING does need the surrounding
"
's what you provide to, say,malloc_json_conv_string()
will be one byte beyond the first surrounding"
and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding"
.Are you saying that the
"
s do not need to be stripped prior to passing it to that function? That would be helpful. EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.We recommend that a wrapper function be given the
"JSON/tencoded/tstring\n"
(with a length that includes both"
's): a block of memory that includes the"
,s AND that the wrapper function verify that the first and last characters are indeed"
AND then calls themalloc_json_conv_string()
function with a arg that points one byte beyond the 1st"
and a length that is-2
of the original length.
Is there a reason to not have the function do it itself though? Or have a boolean that if true does it? That would seem like the cleaner approach to me but that's without knowing the full purpose (and all the uses) of the functions as well as only briefly looking at the function a while back.
To be clear: I mean have the function strip them off.
This way you don't have to pre-strip
"
's in your regexp.
Right. Though there still exists the problem of the regex not being correct.
EDIT: Of course the wrapper function you have seems practical as well if there's a reason to not have it built into the other function.
UPDATE: While your JTYPE_STRING does need the surrounding
"
's what you provide to, say,malloc_json_conv_string()
will be one byte beyond the first surrounding"
and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding"
.Are you saying that the
"
s do not need to be stripped prior to passing it to that function? That would be helpful.EDIT: I'm aware that there are still some comments I've yet to address but those will have to be done another day. I hope that day will be sometime this week.
We recommend that a wrapper function be given the
"JSON/tencoded/tstring\n"
(with a length that includes both"
's): a block of memory that includes the"
,s AND that the wrapper function verify that the first and last characters are indeed"
AND then calls themalloc_json_conv_string()
function with a arg that points one byte beyond the 1st"
and a length that is-2
of the original length.Is there a reason to not have the function do it itself though? Or have a boolean that if true does it? That would seem like the cleaner approach to me but that's without knowing the full purpose (and all the uses) of the functions as well as only briefly looking at the function a while back.
This way you don't have to pre-strip
"
's in your regexp.Right. Though there still exists the problem of the regex not being correct.
Which would you prefer, adding a strip
boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding the strip
boolean as a final argument to those functions would be cleaner.
Which would you prefer, adding a
strip
boolean argument to the JSON string conversion functions, or use a wrapper function? Probably adding thestrip
boolean as a final argument to those functions would be cleaner.
I think it would be cleaner to add it to the function as well yes.
Similarly I've thought to change 'zeroise' to something else. In some cases it could be 'clear' or 'cleared' but in other cases the sentence might have to be changed somewhat. For the boolean in the dynamic array facility it could be
zeroed
orcleared
orempty
or something like that.What do you think? These are only cosmetic changes but might be worth considering on grounds of making it cleaner.
There is a good reason why the dynamic array facility does not automatically zeroize data. It is used in other applications where multi-terabyte arrays are being managed. Running
calloc()
or usingmemset()
to zeroize sections of memory causes massive churn in the memory / VM system.Sorry. I meant clearer. I was talking about the word only.
So the dynamic array facility won't zeroize by default.
TODO: We need to add macros to
dyn_alloc.h
for backward compatibility, BTW.Sounds good.
Clearing a dynamic array has a very different meaning than zeroing data. It means to remove accumulated data in a dynamic array, with an optional zeroize if that was how the dynamic array was setup initially with that mode enabled.
Good point. But still could be
zeroed
or something like that. I'm only talking about using words that are in the dictionary. Or iszeroi[sz]e
actually a term that I'm computer unfamiliar with? I've seen different ways of saying the same thing but never this one.We know that zeroize is not in some dictionaries, however zeroize is a perfectly cromulent word. :-)
And that's what I'm getting at exactly. As you're sure to know I don't use Merriam Webster - I use OED and if I didn't use that I would use another British English one - so I've never heard of it. More to the point though I was thinking of this for vim spelling more than anything else - though other reasons apply as well.
Most dictionaries lag behind common usage: they attempt to describe (instead of proscribe) the language on the date of their publication.
So not having the zeroize: i.e., not having a zeroize definition for such a term of art such as zeroize is not surprising. :-)
As I said in the other thread I am also just typing on my phone and I am about to go get some sleep but I can answer your questions tomorrow @lcn2.
As I also said I will be gone most of Saturday and probably will take a few days to recover.
Finally feel free to edit the title and this message or else let me know what you think is a good start.
We can then discuss what I have so far and how we should proceed with the parser.
I will say quickly before I go that I have kind of held back a bit to see where the structs (or the functions acting on them) go as I think it will be helpful to have something more to work on. Once the structs are sorted it should be easier to actually write the rules and actions in the lexer and parser.
I am not sure but there very possibly is missing grammar too.
Hope this is a good start for the issue. Have a good rest of your day and look forward seeing more here my friend!
Have a safe trip home when you go back.
TODO: