Closed lcn2 closed 2 years ago
You will probably want to read the entire file into memory.
Yes seeing the multiple lines made me think of this indeed. That would change the way the parsing is done but that's okay. I also actually started to wonder if it could be done on more than one line - hours after I wrote the above. That complicates it and it will require reworking the parsing loop possibly.
BTW: We need this read whole file into memory function_ to complete the
jstrencode
andjstrdecode
tools, so expect something inutil.c
soon.
I'll check it out when you've added it.
Then use
strtok_r(3)
to break up the buffer into non-whitespace separated blobs and begin to break apart those blocks into JSON tokens, keep in mind that valid JSON can do this:{"confusion":" { \"hello\" : 123 } , "},
Perhaps this is your approach with your proposed
strscmp
function?
'confusion' is definitely the right word! :) I'll have to have a bit more energy and clear-headedness before I even try parsing that one (actually looking at it again I see that you're just escaping the "
but still I can barely see that right now!). As such I cannot answer your question of whether or not that's what I had in mind. But since as you point out the file needs to be read into memory I'll have to rethink it - and better to do it after you finish that part of the code.
Looking forward to seeing what you come up with. I'll go from there.
Added jstrencode
and jstrdecode
in order to test the new read_all()
function as well as to further test the JSON string encoding and decoding process.
The make test
now runs j-test.sh
to perform some tests on JSON encoding and decoding.
Added
jstrencode
andjstrdecode
in order to test the newread_all()
function as well as to further test the JSON string encoding and decoding process.The
make test
now runsj-test.sh
to perform some tests on JSON encoding and decoding.
Great. I was also thinking that maybe there could be a screwy (format wise) .author.json
and .info.json
file in the tests to verify that the tools (when they're completed - for now they would pass) work right in parsing messed up formats.
As for my last commit I'm unsure why it keeps adding the following commit:
* Readd jauthchk.1 and jinfochk.1
When fixing the previous commits I forgot to do a git add for the two
man pages so they were not actually committed.
...so I think I'll delete my fork and re-fork to hopefully solve that problem. I'm a bit more alert today so hopefully I can do some work. I think I'll start out (after making the new fork) with moving the data for txzchk.c
into a header file for it: since there are several structs. That should make it just a little bit cleaner.
After that I'll figure out what I want to do. Hopefully I can work more on the tools today but we shall see.
Added
jstrencode
andjstrdecode
in order to test the newread_all()
function as well as to further test the JSON string encoding and decoding process.The
make test
now runsj-test.sh
to perform some tests on JSON encoding and decoding.
I've not looked at this yet though I did run make test
. Hopefully when I do look at it it'll give me some ideas on how to parse the JSON files. I might do that later this morning but it might not be until tomorrow as I have a doctor appointment late morning and it's a distance away and I'm already tired (I don't drive for the reason of sleep problems so no worries there!). Anyway I thought I said this (and maybe I did in another thread):
I had the thought that it might be a good idea to come up with some messed up (and maybe obfuscated) json files to test the new tools - when it gets to that point. It could then be in the make test
invocation. Perhaps when I've read the specifications a bit I can do this (and maybe come up with some that don't follow it to test those too).
I have one more thing to write about here and then I'll be done here for now.
As for the
json.c
compiling under CentOS triggers a bunch of warnings. Right now I don't feel alert enough to fix them so I'm just going to paste the output of the compiler here so you can see. If you don't fix them I'll hopefully get to it later. I believe though that I'm going to take a break (not that I've done much but I mean a break from thinking) and then maybe I'll return to it. If not I should be able to tomorrow or the next day (after tonight I should get more sleep again).
As far as this the current list of warnings are now:
cc -D_BSD_SOURCE -D_POSIX_C_SOURCE=200809L -D_XOPEN_SOURCE -std=c99 -O3 -g3 -pedantic -Wall -Wextra -c -o json.o json.c
json.c: In function 'jencchk':
json.c:490:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
^
json.c:607:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
^
json.c:829:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
^
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
if (c < 0 || c > 0xff) {
^
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
json.c: In function 'jencchk':
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
mlen, (unsigned long)jenc[i].len);
On the warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *'
I could change it but I'm not sure of the implications: either changing the int to be unsigned or casting it; thus I won't change it.
The one that seems like it might be more of a problem is:
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
mlen, (unsigned long)jenc[i].len);
^
But it's a lot of code to go through to understand so I'll leave that to you to fix too. As for the always false test I'm not sure what to do either:
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
if (c < 0 || c > 0xff) {
^
It's unsigned 8-bit int and so < 0
is impossible and the same applies to > 0xff
so I'm not sure what to change it to: I don't know the implications of changing it to something else.
As for the
json.c
compiling under CentOS triggers a bunch of warnings. Right now I don't feel alert enough to fix them so I'm just going to paste the output of the compiler here so you can see. If you don't fix them I'll hopefully get to it later. I believe though that I'm going to take a break (not that I've done much but I mean a break from thinking) and then maybe I'll return to it. If not I should be able to tomorrow or the next day (after tonight I should get more sleep again).As far as this the current list of warnings are now:
cc -D_BSD_SOURCE -D_POSIX_C_SOURCE=200809L -D_XOPEN_SOURCE -std=c99 -O3 -g3 -pedantic -Wall -Wextra -c -o json.o json.c json.c: In function 'jencchk': json.c:490:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=] ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard); ^ json.c:607:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=] ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard); ^ json.c:829:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=] ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard); ^ json.c: In function 'json_putc': json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits] if (c < 0 || c > 0xff) { ^ json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits] json.c: In function 'jencchk': json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds] mlen, (unsigned long)jenc[i].len);
On the
warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *'
I could change it but I'm not sure of the implications: either changing the int to be unsigned or casting it; thus I won't change it.The one that seems like it might be more of a problem is:
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds] mlen, (unsigned long)jenc[i].len); ^
But it's a lot of code to go through to understand so I'll leave that to you to fix too. As for the always false test I'm not sure what to do either:
json.c: In function 'json_putc': json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits] if (c < 0 || c > 0xff) { ^
It's unsigned 8-bit int and so
< 0
is impossible and the same applies to> 0xff
so I'm not sure what to change it to: I don't know the implications of changing it to something else.
Thanks @xexyl. Fixed in commit 0a6188c6c979f404f96c8b5eaaf7a853f0c91c98.
As for the
json.c
compiling under CentOS triggers a bunch of warnings. Right now I don't feel alert enough to fix them so I'm just going to paste the output of the compiler here so you can see. If you don't fix them I'll hopefully get to it later. I believe though that I'm going to take a break (not that I've done much but I mean a break from thinking) and then maybe I'll return to it. If not I should be able to tomorrow or the next day (after tonight I should get more sleep again).As far as this the current list of warnings are now:
cc -D_BSD_SOURCE -D_POSIX_C_SOURCE=200809L -D_XOPEN_SOURCE -std=c99 -O3 -g3 -pedantic -Wall -Wextra -c -o json.o json.c json.c: In function 'jencchk': json.c:490:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=] ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard); ^ json.c:607:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=] ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard); ^ json.c:829:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=] ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard); ^ json.c: In function 'json_putc': json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits] if (c < 0 || c > 0xff) { ^ json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits] json.c: In function 'jencchk': json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds] mlen, (unsigned long)jenc[i].len);
On the
warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *'
I could change it but I'm not sure of the implications: either changing the int to be unsigned or casting it; thus I won't change it. The one that seems like it might be more of a problem is:json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds] mlen, (unsigned long)jenc[i].len); ^
But it's a lot of code to go through to understand so I'll leave that to you to fix too. As for the always false test I'm not sure what to do either:
json.c: In function 'json_putc': json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits] if (c < 0 || c > 0xff) { ^
It's unsigned 8-bit int and so
< 0
is impossible and the same applies to> 0xff
so I'm not sure what to change it to: I don't know the implications of changing it to something else.Thanks @xexyl. Fixed in commit 0a6188c.
Thanks as well.
I am done for the day but I hope tomorrow I can get back into working on the tools. Have a good afternoon!
[...]
Thanks @xexyl. Fixed in commit 0a6188c.
Just pulled under CentOS - all good.
The link to the seqcexit repo now works. For some reason the repo was previously marked as private.
See this important comment about use of strict_mode
and the -s
flag.
See this important comment about use of
strict_mode
and the-s
flag.
Yep. All good on that part - and as you'll see I had the same idea.
Okay so here are my comments for this tool. After that I think I'm done with code for the day.
I just realised that I erroneously removed quotes from null
so if the submitter happened to have it be a string in their file the tool would not pick up on it. This will have to be addressed at a later time as I'm too tired to fix it now; but I'm noting it because I don't want to forget.
On that note: is there any problem with the way I've used strcmp()
in the most recent pull request? I'm not sure how strict you want to be. For example in the below:
" mkiocccentry_version" : " 0.35 2022-02-07 ",
Is there a problem with the space right after the quote in"mkiocccentry_version"
or before and after the version in the quoted string? The requirements above do suggest it should match and by that I think of strcmp()==0
but maybe this is not the case? Looking again (and I thought of it at the time of writing the code) I start to wonder if maybe it has to be done differently; perhaps strstr(3) is better than strcmp(3)? That would be easy enough to fix of course and I will do if you need that (or if you have a better idea then I can do that instead).
You'll see in the other thread and the commit how I currently have it so I'll leave it at that.
Now as for the json decoding I'm trying to remember the purpose of it here. Is it because of the escaped characters? What should I look for in the URLs for example? Ah, perhaps it goes like this:
Is that the idea? I think that's it but I want to be sure as I am a bit vague on this atm.
I'll work out the details of parsing the arrays later on. The other stuff I've not parsed yet I talked about in the other thread so I'll leave it to there.
More tomorrow at the latest!
I just realised that I erroneously removed quotes from
null
so if the submitter happened to have it be a string in their file the tool would not pick up on it.
Yes, if someone were to have an affiliation of:
null
The proper JSON would be:
"affiliation" : "null",
However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:
"affiliation" : null,
And that is why in JSON "null" != "" != null
. :-)
I just realised that I erroneously removed quotes from
null
so if the submitter happened to have it be a string in their file the tool would not pick up on it.Yes, if someone were to have an affiliation of:
null
The proper JSON would be:
"affiliation" : "null",
However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:
"affiliation" : null,
And that is why in JSON
"null" != "" != null
. :-)
That makes sense yes. So when these tools are being tested more we definitely have to find all possible cases where quotes were removed and should not be or were not removed and should be.
Maybe there should be some test files for this very detail.
I just realised that I erroneously removed quotes from
null
so if the submitter happened to have it be a string in their file the tool would not pick up on it.Yes, if someone were to have an affiliation of:
null
The proper JSON would be:
"affiliation" : "null",
However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:
"affiliation" : null,
And that is why in JSON
"null" != "" != null
. :-)That makes sense yes. So when these tools are being tested more we definitely have to find all possible cases where quotes were removed and should not be or were not removed and should be.
Maybe there should be some test files for this very detail.
In strict
mode, things such as:
should never have a value of the empty string ""
. The mkiocccentrty
is supposed to see the user just pressing return as wishing to remain anonymous on that item and the JSON value null
should have been written.
OK:
"email" : null
Not OK in strict mode:
"email" : ""
I just realised that I erroneously removed quotes from
null
so if the submitter happened to have it be a string in their file the tool would not pick up on it.Yes, if someone were to have an affiliation of:
null
The proper JSON would be:
"affiliation" : "null",
However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:
"affiliation" : null,
And that is why in JSON
"null" != "" != null
. :-)That makes sense yes. So when these tools are being tested more we definitely have to find all possible cases where quotes were removed and should not be or were not removed and should be. Maybe there should be some test files for this very detail.
In
strict
mode, things such as:
- url
- github
- affiliation
should never have a value of the empty string
""
. Themkiocccentrty
is supposed to see the user just pressing return as wishing to remain anonymous on that item and the JSON valuenull
should have been written.
In this case the help for the -s
option should be reworded so that it doesn't just refer to the spaces with he {
and }
. Perhaps it could just say:
-s strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
OK:
"email" : null
Not OK in strict mode:
"email" : ""
Thanks for clarifying this detail. But it's also true that:
"email" : ""
Is never okay in any instance, right?
In this case the help for the
-s
option should be reworded so that it doesn't just refer to the spaces with he{
and}
. Perhaps it could just say:-s strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
OK:
"email" : null
Not OK in strict mode:
"email" : ""
Thanks for clarifying this detail. But it's also true that:
"email" : ""
Is never okay in any instance, right?
Yes. An Email address must have an @
INSIDE it somewhere, so ""
(which should probably have been null
) is invalid.
There are some JSON test case files that you need to add for this stuff as these are invalid values:
And there is a BUG
in get_author_info()
in mkiocccentry
that needs to be fixed. More than one @
is also invalid:
In this case the help for the
-s
option should be reworded so that it doesn't just refer to the spaces with he{
and}
. Perhaps it could just say:-s strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
Perhaps:
-s strict mode: be more strict on what is allowed (def: not strict)
?
In this case the help for the
-s
option should be reworded so that it doesn't just refer to the spaces with he{
and}
. Perhaps it could just say:-s strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
OK:
"email" : null
Not OK in strict mode:
"email" : ""
Thanks for clarifying this detail. But it's also true that:
"email" : ""
Is never okay in any instance, right?
Yes. An Email address must have an
@
INSIDE it somewhere, so""
(which should probably have beennull
) is invalid.
I also wondered if the mkiocccentry
should check for a domain: not too detailed but just so that someone can't put in luser@luser
. Then the jauthchk
tool could also check this.
There are some JSON test case files that you need to add for this stuff as these are invalid values:
- ""
- "@"
- "foo@"
- "@bar"
Thanks for the note. This is a good idea. I'd add it now but like I said I'm much too tired for it - been awake far too many hours for me. Maybe it belongs with the above idea and the bug fix?
And there is a
BUG
inget_author_info()
inmkiocccentry
that needs to be fixed. More than one@
is also invalid:
- "foo@@bar"
- "foo@bar@baz"
Is that what the bug is? Because I could have sworn I saw the check that only one @
is allowed.
In this case the help for the
-s
option should be reworded so that it doesn't just refer to the spaces with he{
and}
. Perhaps it could just say:-s strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
Perhaps:
-s strict mode: be more strict on what is allowed (def: not strict)
?
That works good yes. I'll add it to the usage message files now so I don't forget. But I won't work on the actual code until later.
And there is a
BUG
inget_author_info()
inmkiocccentry
that needs to be fixed. More than one@
is also invalid:
- "foo@@bar"
- "foo@bar@baz"
Is that what the bug is? Because I could have sworn I saw the check that only one
@
is allowed.
You are correct @xexyl. Well the bug was in the comment that has now been fixed in commit 456b79637587f70090f44585dbe4dfe956da12e7 :-)
And there is a
BUG
inget_author_info()
inmkiocccentry
that needs to be fixed. More than one@
is also invalid:
- "foo@@bar"
- "foo@bar@baz"
Is that what the bug is? Because I could have sworn I saw the check that only one
@
is allowed.You are correct @xexyl. Well the bug was in the comment that has now been fixed in commit 456b796 :-)
Ah okay. Thanks.
Except for the few `all. test cases, we have not (yet) added any 'author.json.' files under test_JSON.
Feel free to add some as we work on 'info.json.*' test cases.
Also: If you encounter a good/*
or strict-good/*
file that seems to be in error,
OR If you encounter a bad/*
or strict-bad/*
file that seems OK, let us know.
It is possible we made mistake, or there is something that didn't show up in a specification.
I thought of some additional checks that maybe should be added to both jauthchk
as well as mkiocccentry
. I'd be happy to do that if you like on the latter too though perhaps after the other tools are finished. Let me know what you think of these:
Should the twitter handles be unique if there's more than one author? Some people do share twitter handles (as of course you know given that you the judges do that as well) but what about submitters?
And although it's highly unlikely it's still possible that two people with the same name could be working together on an entry so I think that probably names should not be restricted to uniqueness too.
But perhaps email should be checked for uniqueness? I'm not sure: people can also share emails after all. And what about GitHub handles? URLs people could also share.
What are your thoughts?
And should URL and emails check for actual domains? See example comment above on that point. To use the technical complete regex for email addresses would be ridiculous (for it is ridiculous!) but more tests could be added. Similarly we could have that for URLs - both in mkiocccentry
and jauthchk
.
What is your preference on these points?
And now I really am shutting the laptop down before anything else comes up! Good night.
Should the twitter handles be unique if there's more than one author? Some people do share twitter handles (as of course you know given that you the judges do that as well) but what about submitters?
How about in strict == true
mode, the should be unique, otherwise for strict == false
they can be the same.
But perhaps email should be checked for uniqueness? I'm not sure: people can also share emails after all. And what about GitHub handles? URLs people could also share.
What are your thoughts?
There isn't a problem if people have an Email address for the object, or use the same 'GitHub' account, so no they should not be checked if they are unique.
And should URL and emails check for actual domains? See example comment above on that point. To use the technical complete regex for email addresses would be ridiculous (for it is ridiculous!) but more tests could be added. Similarly we could have that for URLs - both in
mkiocccentry
andjauthchk
.What is your preference on these points?
Checking for valid Email is well beyond the scope of this tool. We do not recommend checking for valid Email addresses, nor if the twitter account is valid, no if the GitHub account is valid, etc.
Should the twitter handles be unique if there's more than one author? Some people do share twitter handles (as of course you know given that you the judges do that as well) but what about submitters?
How about in
strict == true
mode, the should be unique, otherwise forstrict == false
they can be the same.
Sure. It just crossed my mind as a maybe. But perhaps it’s not really necessary? I will worry about parsing first and then I can do the duplicate checks.
But perhaps email should be checked for uniqueness? I'm not sure: people can also share emails after all. And what about GitHub handles? URLs people could also share. What are your thoughts?
There isn't a problem if people have an Email address for the object, or use the same 'GitHub' account, so no they should not be checked if they are unique.
Yes I rather agree. But I also believe in being complete and since it crossed my mind I figured I would bring it up.
And should URL and emails check for actual domains? See example comment above on that point. To use the technical complete regex for email addresses would be ridiculous (for it is ridiculous!) but more tests could be added. Similarly we could have that for URLs - both in
mkiocccentry
andjauthchk
. What is your preference on these points?Checking for valid Email is well beyond the scope of this tool. We do not recommend checking for valid Email addresses, nor if the twitter account is valid, no if the GitHub account is valid, etc.
Very very very well beyond the scope of the tool I would say! :) But some minor tests could be done too. At the same time though if they’re so silly as to provide invalid information that they lose what can you do and should you bother? A question though.
Since you have to be emailed the contest ID is it possible that this could be used somehow in the tool? I am not sure of any specifics here but bringing it up in case since it seems like it could possibly be of use (though maybe not if only one email is needed for e.g. more than one author in the entry).
Anyway: Once everything is parsed okay we can talk about any strictness and what that might mean with duplicates. For files this has to be done but perhaps for the rest it’s not important.
Before I put the phone down: I have some thoughts on how to parse the arrays but I am open to your thoughts as well. Any ideas on what you would do? It might be that with your additional thoughts it will be easier.
Putting the phone down now. Have a great night!
@lcn2 I guess you didn't mean to at handle me, right?
Before I put the phone down: I have some thoughts on how to parse the arrays but I am open to your thoughts as well. Any ideas on what you would do? It might be that with your additional thoughts it will be easier.
To quote the ECMA-404 standard:
An array structure is a pair of square bracket tokens surrounding zero or more values. The values are separated by commas. The JSON syntax does not define any specific meaning to the ordering of the values. However, the JSON array structure is often used in situations where there is some semantics to the ordering.
So we presume this to mean that an array is an ordered collection of values. I.e., order matters in JSON arrays.
For the authors
array MUST be ordered according each given author's author_number
, starting with 0
and MUST be sequential.
@lcn2 I guess you didn't mean to at handle me, right?
Sorry @bar. We needed to produce examples where @ was followed by a token. The same goes for @baz.
It seems to be a bug where quoted strings still trigger handle notifications. Sorry.
Added the missing winner_handle
to the authors
array of .author.json
.
There MUST be a winner_handle
for each author.
A winner_handle
must be either null:
{
"name" : "author1 middle1a middle1b last1",
"location_code" : "UK",
"location_name" : "United Kingdom",
"email" : null,
"url" : null,
"twitter" : null,
"github" : null,
"affiliation" : null,
"winner_handle" : null,
"author_number" : 1
},
or it must be of the a POSIX Fully portable characters and +:
[A-Za-z0-9][A-Za-z0-9._+-]
I.e.,:
{
"name" : "author2 middle2 last2",
"location_code" : "US",
"location_name" : "United States of America",
"email" : "user2@example.com",
"url" : "http:\/\/host2.example.com\/index.html",
"twitter" : "@twitter2",
"github" : "@github2",
"affiliation" : "an affiliation for #2 author",
"winner_handle" : "author2-last2",
"author_number" : 2
}
Sorry for the late addition. The winner_handle
was something we knew we needed, but didn't know the details.
BTW: The format for winner_handle
is identical to the basename / value of extra_file
in the manifest
.
Before I put the phone down: I have some thoughts on how to parse the arrays but I am open to your thoughts as well. Any ideas on what you would do? It might be that with your additional thoughts it will be easier.
To quote the ECMA-404 standard:
An array structure is a pair of square bracket tokens surrounding zero or more values. The values are separated by commas. The JSON syntax does not define any specific meaning to the ordering of the values. However, the JSON array structure is often used in situations where there is some semantics to the ordering.
So we presume this to mean that an array is an ordered collection of values. I.e., order matters in JSON arrays.
For the
authors
array MUST be ordered according each given author'sauthor_number
, starting with0
and MUST be sequential.
That makes sense. As I said in the other thread I think I have a way to parse the arrays. There are some specifics I have to work out when I get to it (I hope later on today) but for now I have some ideas which will help (I think).
A question though - since the order matters. Is it a problem to enforce the author count to be found prior to reading in the authors array? This would make it easier to parse and test for the right values. For example if the author id is 5 and the author count is 3 I can automatically reject the file.
On the other hand I could do a strstr(3) ahead of time to find the author total. But looking at the file again I see that the total number of authors is not even in the file so I'll add that first. Then because the parser is supposed to parse in any order I'll maybe use the idea I had with strstr(3) to look ahead to find the total number of authors (though a downside with that is I have to make sure to not indicate I have seen it else it would be seen twice according to the parser).
So to recap:
I'm going to add the number of authors to the .author.json
files and that can be used to verify that the number of authors is actually correct: if the number of authors in the authors array is 5 and the total number of authors is 3 then there's a problem (and the reverse is true).
Added the missing
winner_handle
to theauthors
array of.author.json
.There MUST be a
winner_handle
for each author.A
winner_handle
must be either null:{ "name" : "author1 middle1a middle1b last1", "location_code" : "UK", "location_name" : "United Kingdom", "email" : null, "url" : null, "twitter" : null, "github" : null, "affiliation" : null, "winner_handle" : null, "author_number" : 1 },
or it must be of the a POSIX Fully portable characters and +:
[A-Za-z0-9][A-Za-z0-9._+-]
I.e.,:
{ "name" : "author2 middle2 last2", "location_code" : "US", "location_name" : "United States of America", "email" : "user2@example.com", "url" : "http:\/\/host2.example.com\/index.html", "twitter" : "@twitter2", "github" : "@github2", "affiliation" : "an affiliation for #2 author", "winner_handle" : "author2-last2", "author_number" : 2 }
Sorry for the late addition. The
winner_handle
was something we knew we needed, but didn't know the details.
No worries on the delay! I wonder though: if you look at the .author.json
ahead of time how will the judging be anonymous? Or will you rely on the tool to verify that everything is okay and only if the entry wins will you look at the information?
BTW: The format for
winner_handle
is identical to the basename / value ofextra_file
in themanifest
.
In the sense of characters allowed, right? I guess there should be a function that can be used for both (the txzchk
tests are a bit different so they cannot be done in the same way).
But another question that comes up (based on the example you gave which suggests to me it's supposed to be the actual name): is the winner handle supposed to be name or an actual handle? The reason I ask is diacritics. In German if you don't have the umlaut (e.g. ü) you can just add an E e.g. ü becomes ue; but in other languages this might not be the case. How should this be handled?
I'm going to add the number of authors to the .author.json files and that can be used to verify that the number of authors is actually correct: if the number of authors in the authors array is 5 and the total number of authors is 3 then there's a problem (and the reverse is true).
That seems reasonable. We had the number of authors as author_count
in .author.json
at one time (or in our design) in the past. Wanting it back is understandable.
I'm going to add the number of authors to the .author.json files and that can be used to verify that the number of authors is actually correct: if the number of authors in the authors array is 5 and the total number of authors is 3 then there's a problem (and the reverse is true).
That seems reasonable. We had the number of authors as
author_count
in.author.json
at one time (or in our design) in the past. Wanting it back is understandable.
It's already done in one of the commits - a lot of commits this morning! I'm done for the day but I'll continue more tomorrow.
Do you want me to rename it though so it says author_count
? I can do that if that's better.
Edit: Just did it since that's how you had it originally and that's the name of the variable anyway.
No worries on the delay! I wonder though: if you look at the .author.json ahead of time how will the judging be anonymous? Or will you rely on the tool to verify that everything is okay and only if the entry wins will you look at the information?
We will NOT look at .author.json
files of entires that do NOT win.
We have always needed the ability to determine the authorship of a winning entry: in order to announce who won. We have used various methods of separating 'who wrote' from 'what was written' during the judging process.
This is why we have two JSON files. This why author identifiable information NOT appear in .info.json
.
We will have an internal procedure to unpack a compressed tarball in such a way that 'the authorship' will be kept separate from the entry.
This is also why it is important the tools such as jauthchk
(and jinfochk
) not be chatty (without -v level): that they conform well to the Unix tool philosophy. That exit codes be unique for those tools.
We have a plan to sanitize information printed by from calls to err()
, errp()
, warn()
and warnp()
from both jinfochk
, jauthchk
, and txzchk
later on. For example, eventually with those two tools, unless -v level
is used, the error/warning output from those tools won't leak authorship information. But that is a simple pass that will be part of tools (outside the scope of this repo) that is yet to be developed. As of now, we don't want to cramp the development of those tools ...
... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ...
... OK it would be nice if a call to, say err()
, didn't print as part of the diagnostic, the authorship of the entry. :-)
In the sense of characters allowed, right? I guess there should be a function that can be used for both (the txzchk tests are a bit different so they cannot be done in the same way).
We chose the POSIX safe characters for both things like basename of a file AND the winner_handle
for a reason. There will be a filename created somewhere based on those strings by an automated process. We don't want bad side effects to happen when we do so.
No worries on the delay! I wonder though: if you look at the .author.json ahead of time how will the judging be anonymous? Or will you rely on the tool to verify that everything is okay and only if the entry wins will you look at the information?
We will NOT look at
.author.json
files of entires that do NOT win.
That makes much more sense then.
We have always needed the ability to determine the authorship of a winning entry: in order to announce who won. We have used various methods of separating 'who wrote' from 'what was written' during the judging process.
Interesting. Maybe more so since you don't typically reveal certain information about the judging process.
This is why we have two JSON files. This why author identifiable information NOT appear in
.info.json
.
Makes sense.
We will have an internal procedure to unpack a compressed tarball in such a way that 'the authorship' will be kept separate from the entry.
Just making sure!
This is also why it is important the tools such as
jauthchk
(andjinfochk
) not be chatty (without -v level): that they conform well to the Unix tool philosophy. That exit codes be unique for those tools.
Makes sense. But having an error or warning for each problem is still desirable, right?
We have a plan to sanitize information printed by from calls to
err()
,errp()
,warn()
andwarnp()
from bothjinfochk
,jauthchk
, andtxzchk
later on. For example, eventually with those two tools, unless-v level
is used, the error/warning output from those tools won't leak authorship information. But that is a simple pass that will be part of tools (outside the scope of this repo) that is yet to be developed. As of now, we don't want to cramp the development of those tools ...
Good to know. Let me know if you need me to change anything though!
... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ...
Including message, right?
... OK it would be nice if a call to, say
err()
, didn't print as part of the diagnostic, the authorship of the entry. :-)
I'll keep that in mind.
In the sense of characters allowed, right? I guess there should be a function that can be used for both (the txzchk tests are a bit different so they cannot be done in the same way).
We chose the POSIX safe characters for both things like basename of a file AND the
winner_handle
for a reason. There will be a filename created somewhere based on those strings by an automated process. We don't want bad side effects to happen when we do so.
This makes sense yes. I was just wondering how it might work if the person does have accents/etc. in their name. Well I guess it's probably not any different from before.
But another question that comes up (based on the example you gave which suggests to me it's supposed to be the actual name): is the winner handle supposed to be name or an actual handle? The reason I ask is diacritics. In German if you don't have the umlaut (e.g. ü) you can just add an E e.g. ü becomes ue; but in other languages this might not be the case. How should this be handled?
You made a very skilled observation, well done!
And for that, we have a request: :-)
Write a util.c
function that converts characters from an author's name, and turn it into a handle that is within the MAX_WINNER_HANDLE
length limit and containing only those POSIX-safe characters:
[A-Za-z0-9][A-Za-z0-9._+-]*
Such a function would need to translate common non-ASCII characters such you pointed out, into reasonable ASCII equivalent.
Then add author_handle
to the authors
array elements in the .author.json
file.
The conversion need NOT need be perfect. All once needs is that the author_handle
be something that is a safe POSIX filename, AND that a human given both the real name AND the winner_handle
(taken from an author_handle
of a winning entry) could see how they are related.
BTW: This is why we require, in mkioccentry
, for someone to provide some form of a name. I.e., you cannot skip providing some form of an author's name.
Yes, it is possible that two people with similar names, such that their author_handle
's are identical. But when it comes to announcing who won, we will have access to the full JSON data for that entry. Moreover, and outside the scope of this repo, we will have access via the IOCCC entry UUID
to information about who entered the contest (and thus how to contact them). Sure, there are pathological cases where two different people call themselves, anonymous, get the same author_handle
and both WIN. But that is a risk we are very willing to take.
For now, the function needs to make a reasonable translation of the name in forming a author_handle
.
Some suggestions:
BTW: We are looking at increasing the length of MAX_WINNER_HANDLE
as the value put into limit_ioccc.h
was just an initial placeholder.
BTW: Perhaps MAX_WINNER_HANDLE
should be changed to just MAX_HANDLE
so that it can serve as a limit to both winner_handle
and author_handle
? If so, feel free to mak that change now.
A question to ponder: the user of mkiocccentry
provides a winner_handle
, what should happen to the computed author_handle
? Should be be allowed to be different? Should it be changed to match?
Consider testing the transition function by writing the following tool:
formhandle [-h] [-v level] [-V] some name here
The output of formhandle
would be a computed author_handle
and exit 0 is all is well.
But another question that comes up (based on the example you gave which suggests to me it's supposed to be the actual name): is the winner handle supposed to be name or an actual handle? The reason I ask is diacritics. In German if you don't have the umlaut (e.g. ü) you can just add an E e.g. ü becomes ue; but in other languages this might not be the case. How should this be handled?
You made a very skilled observation, well done!
Thank you for that! :)
Well I do have a love for languages and words so it's probably expected.
And for that, we have a request: :-)
Okay.
Write a
util.c
function that converts characters from an author's name, and turn it into a handle that is within theMAX_WINNER_HANDLE
length limit and containing only those POSIX-safe characters:[A-Za-z0-9][A-Za-z0-9._+-]*
Such a function would need to translate common non-ASCII characters such you pointed out, into reasonable ASCII equivalent.
I can do some of this but it would also require knowing the language. I'm not by any means fluent in German but I know some and what I said is true; however I don't know about this for other languages. I guess accents in e.g. Spanish could just be left off? I've never played with strings that have non ASCII characters so this will be interesting.
Then add
author_handle
to theauthors
array elements in the.author.json
file.
In other words the mkiocccentry
would use this function prior to writing the .author.json
file.
The conversion need NOT need be perfect. All once needs is that the
author_handle
be something that is a safe POSIX filename, AND that a human given both the real name AND thewinner_handle
(taken from anauthor_handle
of a winning entry) could see how they are related.
Well that's good that it doesn't need to be perfect! :) But as for the requirements I want to make sure I follow it right.
The basic idea is converting non-ASCII characters to ASCII equivalents - but not using the hated toascii()
function. In the case that there's say umlauts in German (I don't know if this would cause a problem with other languages that have umlauts but do not have this convenient option - what to do about that?) it can simply change it to the letter followed by an E. I'm not actually sure how to go about doing that, mind you, unless I can test for those characters exactly but like I said I've never played with this before.
BTW: This is why we require, in
mkioccentry
, for someone to provide some form of a name. I.e., you cannot skip providing some form of an author's name.
For filenames and paths, in other words?
Yes, it is possible that two people with similar names, such that their
author_handle
's are identical. But when it comes to announcing who won, we will have access to the full JSON data for that entry. Moreover, and outside the scope of this repo, we will have access via theIOCCC entry UUID
to information about who entered the contest (and thus how to contact them). Sure, there are pathological cases where two different people call themselves, anonymous, get the sameauthor_handle
and both WIN. But that is a risk we are very willing to take.
I wondered about this when you started requiring that everyone has to provide a name! But then again one could suggest that for all one knows all the previous anonymous winners is actually a single winner (probably not but it could be as you have no way of knowing).
I also had the thought that (as I think I mentioned yesterday) people with the same name who happen to have a love for C might actually enter the contest together simply because they share the same name - it could happen even if highly unlikely.
For now, the function needs to make a reasonable translation of the name in forming a
author_handle
.
I'm going to have to give this quite some thought I think. It'd be better to finish jinfochk
and jauthchk
first, right?
Some suggestions:
- When the name is too long to form a handle, prefer characters in the last name
That sounds reasonable - make it just like how you have the previous winners: surnames. But then the question is how to identify that it's the surname? I guess one can use strrchr(3) on space and anything after that is the surname. Is that what you would suggest?
- When the last name is too long, prefer leading characters of the last name
In other words truncate the name?
- Consider using a _ in place of spaces in the name
That makes sense and probably very important since spaces in filenames are a pain.
BTW: We are looking at increasing the length of
MAX_WINNER_HANDLE
as the value put intolimit_ioccc.h
was just an initial placeholder.
Will keep an eye out for that.
BTW: Perhaps
MAX_WINNER_HANDLE
should be changed to justMAX_HANDLE
so that it can serve as a limit to bothwinner_handle
andauthor_handle
? If so, feel free to mak that change now.
Perhaps that's an idea. I just did the change - hopefully you haven't already merged the pull request (if so I'll have to redo it I guess).
A question to ponder: the user of
mkiocccentry
provides awinner_handle
, what should happen to the computedauthor_handle
? Should be be allowed to be different? Should it be changed to match?
Good questions indeed. I'm afraid my answer right now is: I've no idea! :)
Consider testing the transition function by writing the following tool:
formhandle [-h] [-v level] [-V] some name here
The output of
formhandle
would be a computedauthor_handle
and exit 0 is all is well.
So the final argv
element would be a name which would use the function in util.c
?
I know I already asked but I'll reiterate it again: should the jinfochk
and jauthchk
tools be finished first? I think that for me that would be ideal - though depending on how the parsing of arrays goes (I did think of another idea a bit ago) it might be a way to break away from it for a bit.
Makes sense. But having an error or warning for each problem is still desirable, right?
Yes.
Makes sense. But having an error or warning for each problem is still desirable, right?
Yes.
I thought so but wanted to be sure!
... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ... Including message, right?
Yes.
Do you want me to rename it though so it says
author_count
? I can do that if that's better.Edit: Just did it since that's how you had it originally and that's the name of the variable anyway.
What you did was fine.
... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ... Including message, right?
Yes.
Right. In your definition exit code includes message because of the err()
and related functions. I wasn't sure of this at first. Thanks.
We need to create the
jauthchk
tool in order to help verify that contents of an.author.json
file found within an entry directory.This tool will primarily be used by other tools (not humans). As such it should behave like
fnamchk
in that if all is well, it should not print anything and simply exit 0. If there are problems found with the .author.json file, then warning messages should be printed to stderr AND thejauthchk
tool should exit with a non-zero status. The use of a-v level
may be use to assist in debugging.The
jauthchk
tool is primarily a stand alone tool. As a sanity check, themkiocccentry
program should execute thejauthchk
code AFTER .author.json file has been created and before the compressed tarball is formed. Ifmkiocccentry
program sees a 0 exit status, then all is well. For a non-zero exit code, the tool probably should abort because any problems detected byjauthchk
based on whatmkiocccentry
wrote into.author.json
indicates there is a serious mismatch between whatmkiocccentry
is doing and whatjauthchk
expects.The following might be how
mkiocccentry
output is changed with the use of this tool (and the other tool):As a stand alone tool, the
jauthchk
tool will be invoked by other tools as part of the IOCCC submission process. That process is beyond the scope of this repo. Suffice it to sat the the IOCCC judges will use this tool is part of their submission workflow.Here is a possible command line usage message:
NOTE: We mention
file
above even though the canonical filename will be.author.json
. The tool should NOT check, nor object to using a different filename.The
mkiocccentry
tool will need to invoke this tool. As such a similar method used to find and specify the location oftxzchk
should be used. As this tool is one of 2 tools being considered, we recommend the following of added to themkiocccentry
command line:IMPORTANT: While it might be tempting to consider depending on some general JSON checker, we do NOT need nor want that. It is important that the mkiocccentry GitHub repo remain stand alone. I.e., all the code needed by someone wishing to enter the IOCCC (beside a C compiler, make, tar, cp, ls) should found in this GitHub repo alone. As there is NO standard JSON tool in widespread distribution the all of the code for this tool needs to reside in this repo only.
IMPORTANT: We do not need a general JSON format checker. We only need to verify that the
file
contains the JSON needed and only the JSON needed for the judges to process IOCCC entries.While is it NOT recommended, if someone wishes to edit their
.author.json
and re-create the compressed tarball we cannot stop them. As suchmkiocccentry
should be STRICT on what is writes into.author.json
ANDjauthchk
should be permissive (but not to a fault) in what is considers as OK.This tool should neither generate an error, nor warn if someone were to reformat the JSON. And as JSON is not order dependent, of someone wishes to reorder the JSON elements, that is fine. As long as all the requirement JSON elements are present, and no new JSON elements are found, and the version string matches, all is OK.
See the a followup comment for details on the checks needed against an
.author.json
file.