Enhancement: create the jauthchk tool - check on the contents of an .author.json file

lcn2 commented 2 years ago

We need to create the jauthchk tool in order to help verify that contents of an .author.json file found within an entry directory.

This tool will primarily be used by other tools (not humans). As such it should behave like fnamchk in that if all is well, it should not print anything and simply exit 0. If there are problems found with the .author.json file, then warning messages should be printed to stderr AND the jauthchk tool should exit with a non-zero status. The use of a -v level may be use to assist in debugging.

The jauthchk tool is primarily a stand alone tool. As a sanity check, the mkiocccentry program should execute the jauthchk code AFTER .author.json file has been created and before the compressed tarball is formed. If mkiocccentry program sees a 0 exit status, then all is well. For a non-zero exit code, the tool probably should abort because any problems detected by jauthchk based on what mkiocccentry wrote into .author.json indicates there is a serious mismatch between what mkiocccentry is doing and what jauthchk expects.

The following might be how mkiocccentry output is changed with the use of this tool (and the other tool):

Is the above list a correct list of files in your entry? [yn]: y

Checking the format of .info.json ...
... all appears well with the .info.json file.

Checking the format of .author.json ...
... all appears well with the .author.json file.

About to run the tar command to form the compressed tarball ...

As a stand alone tool, the jauthchk tool will be invoked by other tools as part of the IOCCC submission process. That process is beyond the scope of this repo. Suffice it to sat the the IOCCC judges will use this tool is part of their submission workflow.

Here is a possible command line usage message:

jauthchk [-h] [-v level] [-V] file

    -h          print help message and exit 0
    -v level        set verbosity level: (def level: 0)
    -V          print version string and exit

        file                    path to a .author.json file

exit codes:

        0                       no errors or warnings detected
        >0                      some error(s) and/or warning(s) were detected

NOTE: We mention file above even though the canonical filename will be .author.json. The tool should NOT check, nor object to using a different filename.

The mkiocccentry tool will need to invoke this tool. As such a similar method used to find and specify the location of txzchk should be used. As this tool is one of 2 tools being considered, we recommend the following of added to the mkiocccentry command line:

    -j /path/to/jinfochk    path to jinfochk executable used by txzchk (def: ./jinfochk)
    -J /path/to/jauthchk    path to jauthchk executable used by txzchk (def: ./jauthchk)

IMPORTANT: While it might be tempting to consider depending on some general JSON checker, we do NOT need nor want that. It is important that the mkiocccentry GitHub repo remain stand alone. I.e., all the code needed by someone wishing to enter the IOCCC (beside a C compiler, make, tar, cp, ls) should found in this GitHub repo alone. As there is NO standard JSON tool in widespread distribution the all of the code for this tool needs to reside in this repo only.

IMPORTANT: We do not need a general JSON format checker. We only need to verify that the file contains the JSON needed and only the JSON needed for the judges to process IOCCC entries.

While is it NOT recommended, if someone wishes to edit their .author.json and re-create the compressed tarball we cannot stop them. As such mkiocccentry should be STRICT on what is writes into .author.json AND jauthchk should be permissive (but not to a fault) in what is considers as OK.

This tool should neither generate an error, nor warn if someone were to reformat the JSON. And as JSON is not order dependent, of someone wishes to reorder the JSON elements, that is fine. As long as all the requirement JSON elements are present, and no new JSON elements are found, and the version string matches, all is OK.

Should something go wrong and a change to the JSON is required during an open IOCCC, the judges will preserve the older JSON check tools and use those against older JSON formats. This there is no need for a >= version check: a version string match seems good enough.

See the a followup comment for details on the checks needed against an .author.json file.

xexyl commented 2 years ago

You will probably want to read the entire file into memory.

Yes seeing the multiple lines made me think of this indeed. That would change the way the parsing is done but that's okay. I also actually started to wonder if it could be done on more than one line - hours after I wrote the above. That complicates it and it will require reworking the parsing loop possibly.

BTW: We need this read whole file into memory function_ to complete the jstrencode and jstrdecode tools, so expect something in util.c soon.

I'll check it out when you've added it.

Then use strtok_r(3) to break up the buffer into non-whitespace separated blobs and begin to break apart those blocks into JSON tokens, keep in mind that valid JSON can do this:
{"confusion":" { \"hello\" : 123 } , "},
Perhaps this is your approach with your proposed strscmp function?

'confusion' is definitely the right word! :) I'll have to have a bit more energy and clear-headedness before I even try parsing that one (actually looking at it again I see that you're just escaping the " but still I can barely see that right now!). As such I cannot answer your question of whether or not that's what I had in mind. But since as you point out the file needs to be read into memory I'll have to rethink it - and better to do it after you finish that part of the code.

Looking forward to seeing what you come up with. I'll go from there.

lcn2 commented 2 years ago

Added jstrencode and jstrdecode in order to test the new read_all() function as well as to further test the JSON string encoding and decoding process.

The make test now runs j-test.sh to perform some tests on JSON encoding and decoding.

xexyl commented 2 years ago

Added jstrencode and jstrdecode in order to test the new read_all() function as well as to further test the JSON string encoding and decoding process.

The make test now runs j-test.sh to perform some tests on JSON encoding and decoding.

Great. I was also thinking that maybe there could be a screwy (format wise) .author.json and .info.json file in the tests to verify that the tools (when they're completed - for now they would pass) work right in parsing messed up formats.

As for my last commit I'm unsure why it keeps adding the following commit:

* Readd jauthchk.1 and jinfochk.1

When fixing the previous commits I forgot to do a git add for the two
man pages so they were not actually committed.

...so I think I'll delete my fork and re-fork to hopefully solve that problem. I'm a bit more alert today so hopefully I can do some work. I think I'll start out (after making the new fork) with moving the data for txzchk.c into a header file for it: since there are several structs. That should make it just a little bit cleaner.

After that I'll figure out what I want to do. Hopefully I can work more on the tools today but we shall see.

xexyl commented 2 years ago

Added jstrencode and jstrdecode in order to test the new read_all() function as well as to further test the JSON string encoding and decoding process.

The make test now runs j-test.sh to perform some tests on JSON encoding and decoding.

I've not looked at this yet though I did run make test. Hopefully when I do look at it it'll give me some ideas on how to parse the JSON files. I might do that later this morning but it might not be until tomorrow as I have a doctor appointment late morning and it's a distance away and I'm already tired (I don't drive for the reason of sleep problems so no worries there!). Anyway I thought I said this (and maybe I did in another thread):

I had the thought that it might be a good idea to come up with some messed up (and maybe obfuscated) json files to test the new tools - when it gets to that point. It could then be in the make test invocation. Perhaps when I've read the specifications a bit I can do this (and maybe come up with some that don't follow it to test those too).

I have one more thing to write about here and then I'll be done here for now.

xexyl commented 2 years ago

As for the json.c compiling under CentOS triggers a bunch of warnings. Right now I don't feel alert enough to fix them so I'm just going to paste the output of the compiler here so you can see. If you don't fix them I'll hopefully get to it later. I believe though that I'm going to take a break (not that I've done much but I mean a break from thinking) and then maybe I'll return to it. If not I should be able to tomorrow or the next day (after tonight I should get more sleep again).

As far as this the current list of warnings are now:

cc -D_BSD_SOURCE -D_POSIX_C_SOURCE=200809L -D_XOPEN_SOURCE -std=c99 -O3 -g3 -pedantic -Wall -Wextra   -c -o json.o json.c
json.c: In function 'jencchk':
json.c:490:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:607:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:829:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
     if (c < 0 || c > 0xff) {
     ^
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
json.c: In function 'jencchk':
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
       mlen, (unsigned long)jenc[i].len);

On the warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' I could change it but I'm not sure of the implications: either changing the int to be unsigned or casting it; thus I won't change it.

The one that seems like it might be more of a problem is:

json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
       mlen, (unsigned long)jenc[i].len);
                                ^

But it's a lot of code to go through to understand so I'll leave that to you to fix too. As for the always false test I'm not sure what to do either:

json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
     if (c < 0 || c > 0xff) {
     ^

It's unsigned 8-bit int and so < 0 is impossible and the same applies to > 0xff so I'm not sure what to change it to: I don't know the implications of changing it to something else.

lcn2 commented 2 years ago

As for the json.c compiling under CentOS triggers a bunch of warnings. Right now I don't feel alert enough to fix them so I'm just going to paste the output of the compiler here so you can see. If you don't fix them I'll hopefully get to it later. I believe though that I'm going to take a break (not that I've done much but I mean a break from thinking) and then maybe I'll return to it. If not I should be able to tomorrow or the next day (after tonight I should get more sleep again).

As far as this the current list of warnings are now:
cc -D_BSD_SOURCE -D_POSIX_C_SOURCE=200809L -D_XOPEN_SOURCE -std=c99 -O3 -g3 -pedantic -Wall -Wextra   -c -o json.o json.c
json.c: In function 'jencchk':
json.c:490:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:607:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:829:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
     if (c < 0 || c > 0xff) {
     ^
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
json.c: In function 'jencchk':
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
       mlen, (unsigned long)jenc[i].len);
On the warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' I could change it but I'm not sure of the implications: either changing the int to be unsigned or casting it; thus I won't change it.

The one that seems like it might be more of a problem is:
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
       mlen, (unsigned long)jenc[i].len);
                                ^
But it's a lot of code to go through to understand so I'll leave that to you to fix too. As for the always false test I'm not sure what to do either:
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
     if (c < 0 || c > 0xff) {
     ^
It's unsigned 8-bit int and so < 0 is impossible and the same applies to > 0xff so I'm not sure what to change it to: I don't know the implications of changing it to something else.

Thanks @xexyl. Fixed in commit 0a6188c6c979f404f96c8b5eaaf7a853f0c91c98.

xexyl commented 2 years ago

As for the json.c compiling under CentOS triggers a bunch of warnings. Right now I don't feel alert enough to fix them so I'm just going to paste the output of the compiler here so you can see. If you don't fix them I'll hopefully get to it later. I believe though that I'm going to take a break (not that I've done much but I mean a break from thinking) and then maybe I'll return to it. If not I should be able to tomorrow or the next day (after tonight I should get more sleep again).

As far as this the current list of warnings are now:
cc -D_BSD_SOURCE -D_POSIX_C_SOURCE=200809L -D_XOPEN_SOURCE -std=c99 -O3 -g3 -pedantic -Wall -Wextra   -c -o json.o json.c
json.c: In function 'jencchk':
json.c:490:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:607:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:829:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
     if (c < 0 || c > 0xff) {
     ^
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
json.c: In function 'jencchk':
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
       mlen, (unsigned long)jenc[i].len);
On the warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' I could change it but I'm not sure of the implications: either changing the int to be unsigned or casting it; thus I won't change it. The one that seems like it might be more of a problem is:
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
       mlen, (unsigned long)jenc[i].len);
                                ^
But it's a lot of code to go through to understand so I'll leave that to you to fix too. As for the always false test I'm not sure what to do either:
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
     if (c < 0 || c > 0xff) {
     ^
It's unsigned 8-bit int and so < 0 is impossible and the same applies to > 0xff so I'm not sure what to change it to: I don't know the implications of changing it to something else.
Thanks @xexyl. Fixed in commit 0a6188c.

Thanks as well.

I am done for the day but I hope tomorrow I can get back into working on the tools. Have a good afternoon!

xexyl commented 2 years ago

[...]

Thanks @xexyl. Fixed in commit 0a6188c.

Just pulled under CentOS - all good.

lcn2 commented 2 years ago

The link to the seqcexit repo now works. For some reason the repo was previously marked as private.

lcn2 commented 2 years ago

See this important comment about use of strict_mode and the -s flag.

xexyl commented 2 years ago

See this important comment about use of strict_mode and the -s flag.

Yep. All good on that part - and as you'll see I had the same idea.

xexyl commented 2 years ago

Okay so here are my comments for this tool. After that I think I'm done with code for the day.

I just realised that I erroneously removed quotes from null so if the submitter happened to have it be a string in their file the tool would not pick up on it. This will have to be addressed at a later time as I'm too tired to fix it now; but I'm noting it because I don't want to forget.

On that note: is there any problem with the way I've used strcmp() in the most recent pull request? I'm not sure how strict you want to be. For example in the below:

        " mkiocccentry_version" : " 0.35 2022-02-07 ",

Is there a problem with the space right after the quote in"mkiocccentry_version" or before and after the version in the quoted string? The requirements above do suggest it should match and by that I think of strcmp()==0 but maybe this is not the case? Looking again (and I thought of it at the time of writing the code) I start to wonder if maybe it has to be done differently; perhaps strstr(3) is better than strcmp(3)? That would be easy enough to fix of course and I will do if you need that (or if you have a better idea then I can do that instead).

You'll see in the other thread and the commit how I currently have it so I'll leave it at that.

Now as for the json decoding I'm trying to remember the purpose of it here. Is it because of the escaped characters? What should I look for in the URLs for example? Ah, perhaps it goes like this:

Decode the string to get the version without the escaped characters; and
Then run the tests on the value like the get author function does (for example on the URL).

Is that the idea? I think that's it but I want to be sure as I am a bit vague on this atm.

I'll work out the details of parsing the arrays later on. The other stuff I've not parsed yet I talked about in the other thread so I'll leave it to there.

More tomorrow at the latest!

lcn2 commented 2 years ago

I just realised that I erroneously removed quotes from null so if the submitter happened to have it be a string in their file the tool would not pick up on it.

Yes, if someone were to have an affiliation of:

null

The proper JSON would be:

"affiliation" : "null",

However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:

"affiliation" : null,

And that is why in JSON "null" != "" != null. :-)

xexyl commented 2 years ago

I just realised that I erroneously removed quotes from null so if the submitter happened to have it be a string in their file the tool would not pick up on it.

Yes, if someone were to have an affiliation of:
null
The proper JSON would be:
"affiliation" : "null",
However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:
"affiliation" : null,
And that is why in JSON "null" != "" != null. :-)

That makes sense yes. So when these tools are being tested more we definitely have to find all possible cases where quotes were removed and should not be or were not removed and should be.

Maybe there should be some test files for this very detail.

lcn2 commented 2 years ago

I just realised that I erroneously removed quotes from null so if the submitter happened to have it be a string in their file the tool would not pick up on it.

Yes, if someone were to have an affiliation of:
null
The proper JSON would be:
"affiliation" : "null",
However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:
"affiliation" : null,
And that is why in JSON "null" != "" != null. :-)
That makes sense yes. So when these tools are being tested more we definitely have to find all possible cases where quotes were removed and should not be or were not removed and should be.

Maybe there should be some test files for this very detail.

In strict mode, things such as:

email
url
twitter
github
affiliation

should never have a value of the empty string "". The mkiocccentrty is supposed to see the user just pressing return as wishing to remain anonymous on that item and the JSON value null should have been written.

OK:

"email" : null

Not OK in strict mode:

"email" : ""

xexyl commented 2 years ago

I just realised that I erroneously removed quotes from null so if the submitter happened to have it be a string in their file the tool would not pick up on it.

Yes, if someone were to have an affiliation of:
null
The proper JSON would be:
"affiliation" : "null",
However if someone were to decline to give an affiliation zsay they want their affiliation to be anonymous), then it would be:
"affiliation" : null,
And that is why in JSON "null" != "" != null. :-)
That makes sense yes. So when these tools are being tested more we definitely have to find all possible cases where quotes were removed and should not be or were not removed and should be. Maybe there should be some test files for this very detail.
In strict mode, things such as:

email

url

twitter

github

affiliation

should never have a value of the empty string "". The mkiocccentrty is supposed to see the user just pressing return as wishing to remain anonymous on that item and the JSON value null should have been written.

In this case the help for the -s option should be reworded so that it doesn't just refer to the spaces with he { and }. Perhaps it could just say:

    -s          strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)

?

OK:
"email" : null
Not OK in strict mode:
"email" : ""

Thanks for clarifying this detail. But it's also true that:

"email" : ""

Is never okay in any instance, right?

lcn2 commented 2 years ago

In this case the help for the -s option should be reworded so that it doesn't just refer to the spaces with he { and }. Perhaps it could just say:
  -s          strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
OK:
"email" : null
Not OK in strict mode:
"email" : ""
Thanks for clarifying this detail. But it's also true that:
"email" : ""
Is never okay in any instance, right?

Yes. An Email address must have an @ INSIDE it somewhere, so "" (which should probably have been null) is invalid.

There are some JSON test case files that you need to add for this stuff as these are invalid values:

""
"@"
"foo@"
"@bar"

And there is a BUG in get_author_info() in mkiocccentry that needs to be fixed. More than one @ is also invalid:

"foo@@bar"
"foo@bar@baz"

lcn2 commented 2 years ago

In this case the help for the -s option should be reworded so that it doesn't just refer to the spaces with he { and }. Perhaps it could just say:
  -s          strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?

Perhaps:

    -s      strict mode: be more strict on what is allowed (def: not strict)

?

xexyl commented 2 years ago

In this case the help for the -s option should be reworded so that it doesn't just refer to the spaces with he { and }. Perhaps it could just say:
    -s          strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
OK:
"email" : null
Not OK in strict mode:
"email" : ""
Thanks for clarifying this detail. But it's also true that:
"email" : ""
Is never okay in any instance, right?
Yes. An Email address must have an @ INSIDE it somewhere, so "" (which should probably have been null) is invalid.

I also wondered if the mkiocccentry should check for a domain: not too detailed but just so that someone can't put in luser@luser. Then the jauthchk tool could also check this.

There are some JSON test case files that you need to add for this stuff as these are invalid values:

""

"@"

"foo@"

"@bar"

Thanks for the note. This is a good idea. I'd add it now but like I said I'm much too tired for it - been awake far too many hours for me. Maybe it belongs with the above idea and the bug fix?

And there is a BUG in get_author_info() in mkiocccentry that needs to be fixed. More than one @ is also invalid:

"foo@@bar"

"foo@bar@baz"

Is that what the bug is? Because I could have sworn I saw the check that only one @ is allowed.

xexyl commented 2 years ago

In this case the help for the -s option should be reworded so that it doesn't just refer to the spaces with he { and }. Perhaps it could just say:
    -s          strict mode: more restrictions on what is allowed (def: in mkiocccentry true; otherwise false)
?
Perhaps:
  -s      strict mode: be more strict on what is allowed (def: not strict)
?

That works good yes. I'll add it to the usage message files now so I don't forget. But I won't work on the actual code until later.

lcn2 commented 2 years ago

And there is a BUG in get_author_info() in mkiocccentry that needs to be fixed. More than one @ is also invalid:

"foo@@bar"

"foo@bar@baz"

Is that what the bug is? Because I could have sworn I saw the check that only one @ is allowed.

You are correct @xexyl. Well the bug was in the comment that has now been fixed in commit 456b79637587f70090f44585dbe4dfe956da12e7 :-)

xexyl commented 2 years ago

And there is a BUG in get_author_info() in mkiocccentry that needs to be fixed. More than one @ is also invalid:

"foo@@bar"

"foo@bar@baz"

Is that what the bug is? Because I could have sworn I saw the check that only one @ is allowed.

You are correct @xexyl. Well the bug was in the comment that has now been fixed in commit 456b796 :-)

Ah okay. Thanks.

lcn2 commented 2 years ago

Except for the few `all. test cases, we have not (yet) added any 'author.json.' files under test_JSON.

Feel free to add some as we work on 'info.json.*' test cases.

Also: If you encounter a good/* or strict-good/* file that seems to be in error, OR If you encounter a bad/* or strict-bad/* file that seems OK, let us know. It is possible we made mistake, or there is something that didn't show up in a specification.

xexyl commented 2 years ago

I thought of some additional checks that maybe should be added to both jauthchk as well as mkiocccentry. I'd be happy to do that if you like on the latter too though perhaps after the other tools are finished. Let me know what you think of these:

Should the twitter handles be unique if there's more than one author? Some people do share twitter handles (as of course you know given that you the judges do that as well) but what about submitters?

And although it's highly unlikely it's still possible that two people with the same name could be working together on an entry so I think that probably names should not be restricted to uniqueness too.

But perhaps email should be checked for uniqueness? I'm not sure: people can also share emails after all. And what about GitHub handles? URLs people could also share.

What are your thoughts?

And should URL and emails check for actual domains? See example comment above on that point. To use the technical complete regex for email addresses would be ridiculous (for it is ridiculous!) but more tests could be added. Similarly we could have that for URLs - both in mkiocccentry and jauthchk.

What is your preference on these points?

And now I really am shutting the laptop down before anything else comes up! Good night.

lcn2 commented 2 years ago

Should the twitter handles be unique if there's more than one author? Some people do share twitter handles (as of course you know given that you the judges do that as well) but what about submitters?

How about in strict == true mode, the should be unique, otherwise for strict == false they can be the same.

lcn2 commented 2 years ago

But perhaps email should be checked for uniqueness? I'm not sure: people can also share emails after all. And what about GitHub handles? URLs people could also share.

What are your thoughts?

There isn't a problem if people have an Email address for the object, or use the same 'GitHub' account, so no they should not be checked if they are unique.

lcn2 commented 2 years ago

And should URL and emails check for actual domains? See example comment above on that point. To use the technical complete regex for email addresses would be ridiculous (for it is ridiculous!) but more tests could be added. Similarly we could have that for URLs - both in mkiocccentry and jauthchk.

What is your preference on these points?

Checking for valid Email is well beyond the scope of this tool. We do not recommend checking for valid Email addresses, nor if the twitter account is valid, no if the GitHub account is valid, etc.

xexyl commented 2 years ago

Should the twitter handles be unique if there's more than one author? Some people do share twitter handles (as of course you know given that you the judges do that as well) but what about submitters?

How about in strict == true mode, the should be unique, otherwise for strict == false they can be the same.

Sure. It just crossed my mind as a maybe. But perhaps it’s not really necessary? I will worry about parsing first and then I can do the duplicate checks.

xexyl commented 2 years ago

But perhaps email should be checked for uniqueness? I'm not sure: people can also share emails after all. And what about GitHub handles? URLs people could also share. What are your thoughts?

There isn't a problem if people have an Email address for the object, or use the same 'GitHub' account, so no they should not be checked if they are unique.

Yes I rather agree. But I also believe in being complete and since it crossed my mind I figured I would bring it up.

xexyl commented 2 years ago

And should URL and emails check for actual domains? See example comment above on that point. To use the technical complete regex for email addresses would be ridiculous (for it is ridiculous!) but more tests could be added. Similarly we could have that for URLs - both in mkiocccentry and jauthchk. What is your preference on these points?

Checking for valid Email is well beyond the scope of this tool. We do not recommend checking for valid Email addresses, nor if the twitter account is valid, no if the GitHub account is valid, etc.

Very very very well beyond the scope of the tool I would say! :) But some minor tests could be done too. At the same time though if they’re so silly as to provide invalid information that they lose what can you do and should you bother? A question though.

Since you have to be emailed the contest ID is it possible that this could be used somehow in the tool? I am not sure of any specifics here but bringing it up in case since it seems like it could possibly be of use (though maybe not if only one email is needed for e.g. more than one author in the entry).

Anyway: Once everything is parsed okay we can talk about any strictness and what that might mean with duplicates. For files this has to be done but perhaps for the rest it’s not important.

Before I put the phone down: I have some thoughts on how to parse the arrays but I am open to your thoughts as well. Any ideas on what you would do? It might be that with your additional thoughts it will be easier.

Putting the phone down now. Have a great night!

bar commented 2 years ago

@lcn2 I guess you didn't mean to at handle me, right?

lcn2 commented 2 years ago

Before I put the phone down: I have some thoughts on how to parse the arrays but I am open to your thoughts as well. Any ideas on what you would do? It might be that with your additional thoughts it will be easier.

To quote the ECMA-404 standard:

An array structure is a pair of square bracket tokens surrounding zero or more values. The values are separated by commas. The JSON syntax does not define any specific meaning to the ordering of the values. However, the JSON array structure is often used in situations where there is some semantics to the ordering.

So we presume this to mean that an array is an ordered collection of values. I.e., order matters in JSON arrays.

For the authors array MUST be ordered according each given author's author_number, starting with 0 and MUST be sequential.

lcn2 commented 2 years ago

@lcn2 I guess you didn't mean to at handle me, right?

Sorry @bar. We needed to produce examples where @ was followed by a token. The same goes for @baz.

It seems to be a bug where quoted strings still trigger handle notifications. Sorry.

lcn2 commented 2 years ago

Added the missing winner_handle to the authors array of .author.json.

There MUST be a winner_handle for each author.

A winner_handle must be either null:

  {
                        "name" : "author1 middle1a middle1b last1",
                        "location_code" : "UK",
                        "location_name" : "United Kingdom",
                        "email" : null,
                        "url" : null,
                        "twitter" : null,
                        "github" : null,
                        "affiliation" : null,
                        "winner_handle" : null,
                        "author_number" : 1
                },

or it must be of the a POSIX Fully portable characters and +:

[A-Za-z0-9][A-Za-z0-9._+-]

I.e.,:

               {
                        "name" : "author2 middle2 last2",
                        "location_code" : "US",
                        "location_name" : "United States of America",
                        "email" : "user2@example.com",
                        "url" : "http:\/\/host2.example.com\/index.html",
                        "twitter" : "@twitter2",
                        "github" : "@github2",
                        "affiliation" : "an affiliation for #2 author",
                        "winner_handle" : "author2-last2",
                        "author_number" : 2
                }

Sorry for the late addition. The winner_handle was something we knew we needed, but didn't know the details.

BTW: The format for winner_handle is identical to the basename / value of extra_file in the manifest.

xexyl commented 2 years ago

Before I put the phone down: I have some thoughts on how to parse the arrays but I am open to your thoughts as well. Any ideas on what you would do? It might be that with your additional thoughts it will be easier.

To quote the ECMA-404 standard:
An array structure is a pair of square bracket tokens surrounding zero or more values. The values are separated by commas. The JSON syntax does not define any specific meaning to the ordering of the values. However, the JSON array structure is often used in situations where there is some semantics to the ordering.
So we presume this to mean that an array is an ordered collection of values. I.e., order matters in JSON arrays.

For the authors array MUST be ordered according each given author's author_number, starting with 0 and MUST be sequential.

That makes sense. As I said in the other thread I think I have a way to parse the arrays. There are some specifics I have to work out when I get to it (I hope later on today) but for now I have some ideas which will help (I think).

A question though - since the order matters. Is it a problem to enforce the author count to be found prior to reading in the authors array? This would make it easier to parse and test for the right values. For example if the author id is 5 and the author count is 3 I can automatically reject the file.

On the other hand I could do a strstr(3) ahead of time to find the author total. But looking at the file again I see that the total number of authors is not even in the file so I'll add that first. Then because the parser is supposed to parse in any order I'll maybe use the idea I had with strstr(3) to look ahead to find the total number of authors (though a downside with that is I have to make sure to not indicate I have seen it else it would be seen twice according to the parser).

So to recap:

I'm going to add the number of authors to the .author.json files and that can be used to verify that the number of authors is actually correct: if the number of authors in the authors array is 5 and the total number of authors is 3 then there's a problem (and the reverse is true).

xexyl commented 2 years ago

Added the missing winner_handle to the authors array of .author.json.

There MUST be a winner_handle for each author.

A winner_handle must be either null:

  {
                        "name" : "author1 middle1a middle1b last1",
                        "location_code" : "UK",
                        "location_name" : "United Kingdom",
                        "email" : null,
                        "url" : null,
                        "twitter" : null,
                        "github" : null,
                        "affiliation" : null,
                        "winner_handle" : null,
                        "author_number" : 1
                },

or it must be of the a POSIX Fully portable characters and +:

[A-Za-z0-9][A-Za-z0-9._+-]

I.e.,:

               {
                        "name" : "author2 middle2 last2",
                        "location_code" : "US",
                        "location_name" : "United States of America",
                        "email" : "user2@example.com",
                        "url" : "http:\/\/host2.example.com\/index.html",
                        "twitter" : "@twitter2",
                        "github" : "@github2",
                        "affiliation" : "an affiliation for #2 author",
                        "winner_handle" : "author2-last2",
                        "author_number" : 2
                }

Sorry for the late addition. The winner_handle was something we knew we needed, but didn't know the details.

No worries on the delay! I wonder though: if you look at the .author.json ahead of time how will the judging be anonymous? Or will you rely on the tool to verify that everything is okay and only if the entry wins will you look at the information?

BTW: The format for winner_handle is identical to the basename / value of extra_file in the manifest.

In the sense of characters allowed, right? I guess there should be a function that can be used for both (the txzchk tests are a bit different so they cannot be done in the same way).

But another question that comes up (based on the example you gave which suggests to me it's supposed to be the actual name): is the winner handle supposed to be name or an actual handle? The reason I ask is diacritics. In German if you don't have the umlaut (e.g. ü) you can just add an E e.g. ü becomes ue; but in other languages this might not be the case. How should this be handled?

lcn2 commented 2 years ago

I'm going to add the number of authors to the .author.json files and that can be used to verify that the number of authors is actually correct: if the number of authors in the authors array is 5 and the total number of authors is 3 then there's a problem (and the reverse is true).

That seems reasonable. We had the number of authors as author_count in .author.json at one time (or in our design) in the past. Wanting it back is understandable.

xexyl commented 2 years ago

I'm going to add the number of authors to the .author.json files and that can be used to verify that the number of authors is actually correct: if the number of authors in the authors array is 5 and the total number of authors is 3 then there's a problem (and the reverse is true).

That seems reasonable. We had the number of authors as author_count in .author.json at one time (or in our design) in the past. Wanting it back is understandable.

It's already done in one of the commits - a lot of commits this morning! I'm done for the day but I'll continue more tomorrow.

xexyl commented 2 years ago

Do you want me to rename it though so it says author_count? I can do that if that's better.

Edit: Just did it since that's how you had it originally and that's the name of the variable anyway.

lcn2 commented 2 years ago

No worries on the delay! I wonder though: if you look at the .author.json ahead of time how will the judging be anonymous? Or will you rely on the tool to verify that everything is okay and only if the entry wins will you look at the information?

We will NOT look at .author.json files of entires that do NOT win.

We have always needed the ability to determine the authorship of a winning entry: in order to announce who won. We have used various methods of separating 'who wrote' from 'what was written' during the judging process.

This is why we have two JSON files. This why author identifiable information NOT appear in .info.json.

We will have an internal procedure to unpack a compressed tarball in such a way that 'the authorship' will be kept separate from the entry.

This is also why it is important the tools such as jauthchk (and jinfochk) not be chatty (without -v level): that they conform well to the Unix tool philosophy. That exit codes be unique for those tools.

We have a plan to sanitize information printed by from calls to err(), errp(), warn() and warnp() from both jinfochk, jauthchk, and txzchk later on. For example, eventually with those two tools, unless -v level is used, the error/warning output from those tools won't leak authorship information. But that is a simple pass that will be part of tools (outside the scope of this repo) that is yet to be developed. As of now, we don't want to cramp the development of those tools ...

... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ...

... OK it would be nice if a call to, say err(), didn't print as part of the diagnostic, the authorship of the entry. :-)

lcn2 commented 2 years ago

In the sense of characters allowed, right? I guess there should be a function that can be used for both (the txzchk tests are a bit different so they cannot be done in the same way).

We chose the POSIX safe characters for both things like basename of a file AND the winner_handle for a reason. There will be a filename created somewhere based on those strings by an automated process. We don't want bad side effects to happen when we do so.

xexyl commented 2 years ago

No worries on the delay! I wonder though: if you look at the .author.json ahead of time how will the judging be anonymous? Or will you rely on the tool to verify that everything is okay and only if the entry wins will you look at the information?

We will NOT look at .author.json files of entires that do NOT win.

That makes much more sense then.

We have always needed the ability to determine the authorship of a winning entry: in order to announce who won. We have used various methods of separating 'who wrote' from 'what was written' during the judging process.

Interesting. Maybe more so since you don't typically reveal certain information about the judging process.

This is why we have two JSON files. This why author identifiable information NOT appear in .info.json.

Makes sense.

We will have an internal procedure to unpack a compressed tarball in such a way that 'the authorship' will be kept separate from the entry.

Just making sure!

This is also why it is important the tools such as jauthchk (and jinfochk) not be chatty (without -v level): that they conform well to the Unix tool philosophy. That exit codes be unique for those tools.

Makes sense. But having an error or warning for each problem is still desirable, right?

We have a plan to sanitize information printed by from calls to err(), errp(), warn() and warnp() from both jinfochk, jauthchk, and txzchk later on. For example, eventually with those two tools, unless -v level is used, the error/warning output from those tools won't leak authorship information. But that is a simple pass that will be part of tools (outside the scope of this repo) that is yet to be developed. As of now, we don't want to cramp the development of those tools ...

Good to know. Let me know if you need me to change anything though!

... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ...

Including message, right?

... OK it would be nice if a call to, say err(), didn't print as part of the diagnostic, the authorship of the entry. :-)

I'll keep that in mind.

xexyl commented 2 years ago

In the sense of characters allowed, right? I guess there should be a function that can be used for both (the txzchk tests are a bit different so they cannot be done in the same way).

We chose the POSIX safe characters for both things like basename of a file AND the winner_handle for a reason. There will be a filename created somewhere based on those strings by an automated process. We don't want bad side effects to happen when we do so.

This makes sense yes. I was just wondering how it might work if the person does have accents/etc. in their name. Well I guess it's probably not any different from before.

lcn2 commented 2 years ago

But another question that comes up (based on the example you gave which suggests to me it's supposed to be the actual name): is the winner handle supposed to be name or an actual handle? The reason I ask is diacritics. In German if you don't have the umlaut (e.g. ü) you can just add an E e.g. ü becomes ue; but in other languages this might not be the case. How should this be handled?

You made a very skilled observation, well done!

And for that, we have a request: :-)

Write a util.c function that converts characters from an author's name, and turn it into a handle that is within the MAX_WINNER_HANDLE length limit and containing only those POSIX-safe characters:

[A-Za-z0-9][A-Za-z0-9._+-]*

Such a function would need to translate common non-ASCII characters such you pointed out, into reasonable ASCII equivalent.

Then add author_handle to the authors array elements in the .author.json file.

The conversion need NOT need be perfect. All once needs is that the author_handle be something that is a safe POSIX filename, AND that a human given both the real name AND the winner_handle (taken from an author_handle of a winning entry) could see how they are related.

BTW: This is why we require, in mkioccentry, for someone to provide some form of a name. I.e., you cannot skip providing some form of an author's name.

Yes, it is possible that two people with similar names, such that their author_handle's are identical. But when it comes to announcing who won, we will have access to the full JSON data for that entry. Moreover, and outside the scope of this repo, we will have access via the IOCCC entry UUID to information about who entered the contest (and thus how to contact them). Sure, there are pathological cases where two different people call themselves, anonymous, get the same author_handle and both WIN. But that is a risk we are very willing to take.

For now, the function needs to make a reasonable translation of the name in forming a author_handle.

Some suggestions:

When the name is too long to form a handle, prefer characters in the last name
When the last name is too long, prefer leading characters of the last name
Consider using a _ in place of spaces in the name

BTW: We are looking at increasing the length of MAX_WINNER_HANDLE as the value put into limit_ioccc.h was just an initial placeholder.

BTW: Perhaps MAX_WINNER_HANDLE should be changed to just MAX_HANDLE so that it can serve as a limit to both winner_handle and author_handle? If so, feel free to mak that change now.

A question to ponder: the user of mkiocccentry provides a winner_handle, what should happen to the computed author_handle? Should be be allowed to be different? Should it be changed to match?

Consider testing the transition function by writing the following tool:

formhandle [-h] [-v level] [-V] some name here

The output of formhandle would be a computed author_handle and exit 0 is all is well.

xexyl commented 2 years ago

But another question that comes up (based on the example you gave which suggests to me it's supposed to be the actual name): is the winner handle supposed to be name or an actual handle? The reason I ask is diacritics. In German if you don't have the umlaut (e.g. ü) you can just add an E e.g. ü becomes ue; but in other languages this might not be the case. How should this be handled?

You made a very skilled observation, well done!

Thank you for that! :)

Well I do have a love for languages and words so it's probably expected.

And for that, we have a request: :-)

Okay.

Write a util.c function that converts characters from an author's name, and turn it into a handle that is within the MAX_WINNER_HANDLE length limit and containing only those POSIX-safe characters:

[A-Za-z0-9][A-Za-z0-9._+-]*

Such a function would need to translate common non-ASCII characters such you pointed out, into reasonable ASCII equivalent.

I can do some of this but it would also require knowing the language. I'm not by any means fluent in German but I know some and what I said is true; however I don't know about this for other languages. I guess accents in e.g. Spanish could just be left off? I've never played with strings that have non ASCII characters so this will be interesting.

Then add author_handle to the authors array elements in the .author.json file.

In other words the mkiocccentry would use this function prior to writing the .author.json file.

The conversion need NOT need be perfect. All once needs is that the author_handle be something that is a safe POSIX filename, AND that a human given both the real name AND the winner_handle (taken from an author_handle of a winning entry) could see how they are related.

Well that's good that it doesn't need to be perfect! :) But as for the requirements I want to make sure I follow it right.

The basic idea is converting non-ASCII characters to ASCII equivalents - but not using the hated toascii() function. In the case that there's say umlauts in German (I don't know if this would cause a problem with other languages that have umlauts but do not have this convenient option - what to do about that?) it can simply change it to the letter followed by an E. I'm not actually sure how to go about doing that, mind you, unless I can test for those characters exactly but like I said I've never played with this before.

BTW: This is why we require, in mkioccentry, for someone to provide some form of a name. I.e., you cannot skip providing some form of an author's name.

For filenames and paths, in other words?

Yes, it is possible that two people with similar names, such that their author_handle's are identical. But when it comes to announcing who won, we will have access to the full JSON data for that entry. Moreover, and outside the scope of this repo, we will have access via the IOCCC entry UUID to information about who entered the contest (and thus how to contact them). Sure, there are pathological cases where two different people call themselves, anonymous, get the same author_handle and both WIN. But that is a risk we are very willing to take.

I wondered about this when you started requiring that everyone has to provide a name! But then again one could suggest that for all one knows all the previous anonymous winners is actually a single winner (probably not but it could be as you have no way of knowing).

I also had the thought that (as I think I mentioned yesterday) people with the same name who happen to have a love for C might actually enter the contest together simply because they share the same name - it could happen even if highly unlikely.

For now, the function needs to make a reasonable translation of the name in forming a author_handle.

I'm going to have to give this quite some thought I think. It'd be better to finish jinfochk and jauthchk first, right?

Some suggestions:

When the name is too long to form a handle, prefer characters in the last name

That sounds reasonable - make it just like how you have the previous winners: surnames. But then the question is how to identify that it's the surname? I guess one can use strrchr(3) on space and anything after that is the surname. Is that what you would suggest?

When the last name is too long, prefer leading characters of the last name

In other words truncate the name?

Consider using a _ in place of spaces in the name

That makes sense and probably very important since spaces in filenames are a pain.

BTW: We are looking at increasing the length of MAX_WINNER_HANDLE as the value put into limit_ioccc.h was just an initial placeholder.

Will keep an eye out for that.

BTW: Perhaps MAX_WINNER_HANDLE should be changed to just MAX_HANDLE so that it can serve as a limit to both winner_handle and author_handle? If so, feel free to mak that change now.

Perhaps that's an idea. I just did the change - hopefully you haven't already merged the pull request (if so I'll have to redo it I guess).

A question to ponder: the user of mkiocccentry provides a winner_handle, what should happen to the computed author_handle? Should be be allowed to be different? Should it be changed to match?

Good questions indeed. I'm afraid my answer right now is: I've no idea! :)

Consider testing the transition function by writing the following tool:
formhandle [-h] [-v level] [-V] some name here
The output of formhandle would be a computed author_handle and exit 0 is all is well.

So the final argv element would be a name which would use the function in util.c?

I know I already asked but I'll reiterate it again: should the jinfochk and jauthchk tools be finished first? I think that for me that would be ideal - though depending on how the parsing of arrays goes (I did think of another idea a bit ago) it might be a way to break away from it for a bit.

lcn2 commented 2 years ago

Makes sense. But having an error or warning for each problem is still desirable, right?

Yes.

xexyl commented 2 years ago

Makes sense. But having an error or warning for each problem is still desirable, right?

Yes.

I thought so but wanted to be sure!

lcn2 commented 2 years ago

... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ... Including message, right?

Yes.

lcn2 commented 2 years ago

Do you want me to rename it though so it says author_count? I can do that if that's better.

Edit: Just did it since that's how you had it originally and that's the name of the variable anyway.

What you did was fine.

xexyl commented 2 years ago

... apart from asking that those tools be able to say nothing if all is well, to exit 0 when all is well, to exit non-zero in a reasonably unique exit code ... Including message, right?

Yes.

Right. In your definition exit code includes message because of the err() and related functions. I wasn't sure of this at first. Thanks.

ioccc-src / mkiocccentry

Enhancement: create the jauthchk tool - check on the contents of an .author.json file #69