ioccc-src / mkiocccentry

Form an IOCCC submission as a compressed tarball file
Other
28 stars 6 forks source link

Enhancement: create the jauthchk tool - check on the contents of an .author.json file #69

Closed lcn2 closed 2 years ago

lcn2 commented 2 years ago

We need to create the jauthchk tool in order to help verify that contents of an .author.json file found within an entry directory.

This tool will primarily be used by other tools (not humans). As such it should behave like fnamchk in that if all is well, it should not print anything and simply exit 0. If there are problems found with the .author.json file, then warning messages should be printed to stderr AND the jauthchk tool should exit with a non-zero status. The use of a -v level may be use to assist in debugging.

The jauthchk tool is primarily a stand alone tool. As a sanity check, the mkiocccentry program should execute the jauthchk code AFTER .author.json file has been created and before the compressed tarball is formed. If mkiocccentry program sees a 0 exit status, then all is well. For a non-zero exit code, the tool probably should abort because any problems detected by jauthchk based on what mkiocccentry wrote into .author.json indicates there is a serious mismatch between what mkiocccentry is doing and what jauthchk expects.

The following might be how mkiocccentry output is changed with the use of this tool (and the other tool):

Is the above list a correct list of files in your entry? [yn]: y

Checking the format of .info.json ...
... all appears well with the .info.json file.

Checking the format of .author.json ...
... all appears well with the .author.json file.

About to run the tar command to form the compressed tarball ...

As a stand alone tool, the jauthchk tool will be invoked by other tools as part of the IOCCC submission process. That process is beyond the scope of this repo. Suffice it to sat the the IOCCC judges will use this tool is part of their submission workflow.

Here is a possible command line usage message:

jauthchk [-h] [-v level] [-V] file

    -h          print help message and exit 0
    -v level        set verbosity level: (def level: 0)
    -V          print version string and exit

        file                    path to a .author.json file

exit codes:

        0                       no errors or warnings detected
        >0                      some error(s) and/or warning(s) were detected

NOTE: We mention file above even though the canonical filename will be .author.json. The tool should NOT check, nor object to using a different filename.

The mkiocccentry tool will need to invoke this tool. As such a similar method used to find and specify the location of txzchk should be used. As this tool is one of 2 tools being considered, we recommend the following of added to the mkiocccentry command line:

    -j /path/to/jinfochk    path to jinfochk executable used by txzchk (def: ./jinfochk)
    -J /path/to/jauthchk    path to jauthchk executable used by txzchk (def: ./jauthchk)

IMPORTANT: While it might be tempting to consider depending on some general JSON checker, we do NOT need nor want that. It is important that the mkiocccentry GitHub repo remain stand alone. I.e., all the code needed by someone wishing to enter the IOCCC (beside a C compiler, make, tar, cp, ls) should found in this GitHub repo alone. As there is NO standard JSON tool in widespread distribution the all of the code for this tool needs to reside in this repo only.

IMPORTANT: We do not need a general JSON format checker. We only need to verify that the file contains the JSON needed and only the JSON needed for the judges to process IOCCC entries.

While is it NOT recommended, if someone wishes to edit their .author.json and re-create the compressed tarball we cannot stop them. As such mkiocccentry should be STRICT on what is writes into .author.json AND jauthchk should be permissive (but not to a fault) in what is considers as OK.

This tool should neither generate an error, nor warn if someone were to reformat the JSON. And as JSON is not order dependent, of someone wishes to reorder the JSON elements, that is fine. As long as all the requirement JSON elements are present, and no new JSON elements are found, and the version string matches, all is OK.

Should something go wrong and a change to the JSON is required during an open IOCCC, the judges will preserve the older JSON check tools and use those against older JSON formats. This there is no need for a >= version check: a version string match seems good enough.

See the a followup comment for details on the checks needed against an .author.json file.

lcn2 commented 2 years ago

The following is a guide as to what needs to be checked in an .author.json file. The values found below are simply examples.

To recap an import point made above:

IMPORTANT: We do not need a general JSON format checker. We only need to verify that the file contains the JSON needed and only the JSON needed for the judges to process IOCCC entries.

While is it NOT recommended, if someone wishes to edit their .author.json and re-create the compressed tarball we cannot stop them. As such mkiocccentry should be STRICT on what is writes into .author.json AND jinfochk should be permissive (but not to a fault) in what is considers as OK.

This tool should neither generate an error, nor warn if someone were to reformat the JSON. And as JSON is not order dependent, of someone wishes to reorder the JSON elements, that is fine. As long as all the requirement JSON elements are present, and no new JSON elements are found, and the version string matches, all is OK.

See www.json.org for details of the JSON format.

{

All files must start with a {.

    "IOCCC_info_version" : "1.7 2022-02-04",

The IOCCC_info_version value MUST match INFO_VERSION from limit_ioccc.h.

    "ioccc_contest" : "IOCCC28",

The ioccc_contest value MUST match IOCCC_CONTEST from limit_ioccc.h.

    "ioccc_year" : 2022,

The ioccc_year value MUST match IOCCC_YEAR from limit_ioccc.h.

    "mkiocccentry_version" : "0.35 2022-02-07",

The mkiocccentry_version value MUST match MKIOCCCENTRY_VERSION from limit_ioccc.h.

    "iocccsize_version" : "28.7 2022-02-01",

The iocccsize_version value MUST match IOCCCSIZE_VERSION from limit_ioccc.h.

    "IOCCC_contest_id" : "test",

The IOCCC_contest_id value MUST be either test or a valid Contest ID (a UUID with version UUID_VERSION and variant UUID_VARIANT as defined in limit_ioccc.h`).

    "entry_num" : 0,

The entry_num value must be >= 0 and <= MAX_ENTRY_NUM as defined in limit_ioccc.h.

    "authors" : [
        {
            "name" : "author name",
            "location_code" : "CC",
            "location_name" : "Cocos (Keeling) Islands (the)",
            "email" : "test@example.com",
            "url" : "https:\/\/example.com\/index.html",
            "twitter" : "@twitter",
            "github" : "@github",
            "affiliation" : "an affiliation",
            "winner_handle" : "author-last",
            "author_number" : 0
        }
    ],

The authors must be a JSON array. In that array there MUST be exactly one of:

The values conform with the restrictions imposed by the get_author_info() function from mkiocccentry.c.

NOTE: When the user declined to enter a value (where permitted by the get_author_info() function), the value will be the JSON special value null.

NOTE: Due to a design flaw of the JSON spec, the last value of the authors array cannot be followed by a "," (ASCII comma). All value except the last value of the manifest array must be followed by a "," (ASCII comma).

    "formed_timestamp" : 1644618833,

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h.

    "formed_timestamp_usec" : 631668,

The formed_timestamp_usec value must be an integer >= 0 and <= 999999.

    "timestamp_epoch" : "Thr Jan  1 00:00:00 1970 UTC",

The timestamp_epoch value must match TIMESTAMP_EPOCH as defined in limit_ioccc.h.

    "min_timestamp" : 1643987926

The min_timestamp value must match MIN_TIMESTAMP as defined in limit_ioccc.h.

    "formed_UTC" : "Fri Feb 11 23:09:42 2022 UTC"

The formed_UTCvalue must be a date string in the same format as the output of the following date command:

date "+%a %b %d %H:%M:%S %Y UTC"
}

All files must end with a }.

To be valid, the .author.json file must have exactly one of each of the above mentioned JSON "string: : "value" (or JSON array in the case of manifest). No other JSON elements are allowed.

lcn2 commented 2 years ago

Comments, suggestions, corrections and clarifications for the above long comment are welcome.

We recommend that you copy only the relevant parts of the long comment when you do. :-)

If/where needed, we will attempt to modify the long comments in place above, where and when possible.

xexyl commented 2 years ago

See www.json.org for details of the JSON format.

This will be helpful to me. Thanks.

{

All files must start with a {.

That's easy enough to check.

  "IOCCC_info_version" : "1.7 2022-02-04",

The IOCCC_info_version value MUST match INFO_VERSION from limit_ioccc.h.

  "ioccc_contest" : "IOCCC28",

The ioccc_contest value MUST match IOCCC_CONTEST from limit_ioccc.h.

For the above two we can just use strtok_r() on : and then when finding the right first field do a strcmp() on the defines.

  "ioccc_year" : 2022,

The ioccc_year value MUST match IOCCC_YEAR from limit_ioccc.h.

In this case strtol() can be called and do a simple comparison. Should it be an error if someone were to change it to be:

"ioccc_year" : "2022",

I.e. add quotes or something else. That is a general question for all the other fields and the other file as well: if the user changes it so that it's not the right type (for example - int to string) should this be an error in validation OR should the quotes be ignored?

Since you're only after if the right fields (with the right values) are there is it okay if the correct fields are there at the same time as the format being wrong (by format being wrong I mean not validly JSON)?

  "IOCCC_contest_id" : "test",

The IOCCC_contest_id value MUST be either test or a valid Contest ID (a UUID with version UUID_VERSION and variant UUID_VARIANT as defined in limit_ioccc.h`).

As for this one: how do you propose this is tested? "test" is easy enough but what about the rest? Should it do something like fnamchk does?

What might be helpful here: can you provide an actual UUID string that might be valid in the contest so that testing this tool can be easier?

  "authors" : [
      {
          "name" : "author name",
          "location_code" : "CC",
          "location_name" : "Cocos (Keeling) Islands (the)",
          "email" : "test@example.com",
          "url" : "https:\/\/example.com\/index.html",
          "twitter" : "@twitter",
          "github" : "@github",
          "affiliation" : "an affiliation",
          "author_number" : 0
      }
  ],

The authors must be a JSON array. In that array there MUST be exactly one of:

Here this makes me think that the format of the JSON file does in fact matter - since you say it must be an array and the fields must be exactly the below. Does this hold everywhere else too?

  • name
  • location_code
  • location_name
  • email
  • url
  • twitter
  • github
  • affiliation
  • author_number

The values conform with the restrictions imposed by the get_author_info() function from mkiocccentry.c.

That should be helpful.

NOTE: When the user declined to enter a value (where permitted by the get_author_info() function), the value will be the JSON special value null.

Good to know.

NOTE: Due to a design flaw of the JSON spec, the last value of the authors array cannot be followed by a "," (ASCII comma). All value except the last value of the manifest array must be followed by a "," (ASCII comma).

That's interesting bit of information. Thanks for clarifying this.

  "formed_timestamp" : 1644618833,

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h.

Which integer type do you suggest? time_t?

  "formed_UTC" : "Fri Feb 11 23:09:42 2022 UTC"

The formed_UTCvalue must be a date string in the same format as the output of the following date command:

date "+%a %b %d %H:%M:%S %Y UTC"
}

Do you have a recommended way to go about this? The first thing that popped into my head is strptime() but I'm not sure if that's the best/better way to go about it.

--

These comments apply to this tool as well as what I'll say about the next one (if any comments/questions):

Hopefully these questions can start a good conversation on it and help bring clarity. Although I started this I'm not sure if I can finish this one. I might be able to but I think I'll have to take a break for a few days and maybe return to it in the middle of next week. Really what it comes down to is learning a bit about JSON (after your replies).

That being said I might start writing a parser that simply separates the field name and value into a list (I don't mean a linked list necessarily). I don't think that'll be today though: I'd rather get your feedback first and then process it.

It's possible tomorrow morning I can work on this a bit but if not there's no other time I can tomorrow. Monday I should be able to do a bit of it but I might be away a while since it is after all my 40th birthday (that's also - as I said - why I'll be gone most of tomorrow).

xexyl commented 2 years ago

Something occurred to me that you did not address: should certain characters be verified that they're escaped? For example should the URL have the /s escaped like it is printed out by mkiocccentry?

Also should the mkiocccentry tool check for characters like \ in URLs? (Maybe it already does and I don't recall).

lcn2 commented 2 years ago

In this case strtol() can be called and do a simple comparison. Should it be an error if someone were to change it to be:

"ioccc_year" : "2022", I.e. add quotes or something else. That is a general question for all the other fields and the other file as well: if the user changes it so that it's not the right type (for example - int to string) should this be an error in validation OR should the quotes be ignored?

Since you're only after if the right fields (with the right values) are there is it okay if the correct fields are there at the same time as the format being wrong (by format being wrong I mean not validly JSON)?

The value 2022 without quotes is correct. It is valid JSON (see the spec). The value "2022", a string, would be an error for that numeric value.

We will reply later to your other comments.

xexyl commented 2 years ago

The value 2022 without quotes is correct. It is valid JSON (see the spec). The value "2022", a string, would be an error for that numeric value.

Does that mean that with quotes it should be considered invalid in the context of the tool? I get that impression but want to be sure.

We will reply later to your other comments.

Thank you. I'll probably look at it tomorrow - or else later on today if I get a chance.

lcn2 commented 2 years ago

The value 2022 without quotes is correct. It is valid JSON (see the spec). The value "2022", a string, would be an error for that numeric value.

Does that mean that with quotes it should be considered invalid in the context of the tool? I get that impression but want to be sure.

We will reply later to your other comments.

Thank you. I'll probably look at it tomorrow - or else later on today if I get a chance.

Yes. If the value is numeric, then there MUST be no quotes around the value. In JSON:

"123" != 123

Only strings are names appear to be in double quotes. JSON values such these are not double quoted:

In the JSON used by IOCCC, we do not have use for numeric values that are non-integers. So the last 2 value forms may be safely ignored.

lcn2 commented 2 years ago

For the above two we can just use strtok_r() on : and then when finding the right first field do a strcmp() on the defines.

We are sure that you remember that the JSON elements may come in any order, and that whitespace can change without impacting the JSON validity, and that string such as "string string2" can have whitespace within them, such as in a name.

lcn2 commented 2 years ago

What might be helpful here: can you provide an actual UUID string that might be valid in the contest so that testing this tool can be easier?

Here is a sample UUID:

12345678-1234-4321-abcd-1234567890ab

For more info see this other comment.

lcn2 commented 2 years ago

Here this makes me think that the format of the JSON file does in fact matter - since you say it must be an array and the fields must be exactly the below. Does this hold everywhere else too?

Probably, yes.

lcn2 commented 2 years ago
 "formed_timestamp" : 1644618833,

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h.

Which integer type do you suggest? time_t?

JSON numbers can be of any length. JSON numbers are typeless. JSON integers are just a string of decimal digits of any length.

See the number section of the JSON spec.

You need not support huge multi-precision numbers. Instead try to form a long long.

You might want to look at the length of the characters of a JSON number. Now LLONG_MAX is:

0x7fffffffffffffff == 9223372036854775807

And in decimal, 9223372036854775807 has 19 digits. So define:

#define LLONG_MAX_BASE10_DIGITS (19)

Then if the length (not counting and leading - sign) of the JSON number exceeds LLONG_MAX_BASE10_DIGITS, reject it as being too large.

Then use strtoll(3) to convert the characters of the JSON number into a long long, taking care (as perviously discussed) about detecting when errno is changed to non-zero (by presetting it to 0 before the strtoll(3) call) and doing all that stuff about rejecting LLONG_MIN and LLONG_MAX, and finally assigning it to a long long value.

While the checking length (ignoring any leading - sign) of the JSON number is option, it might still be a good idea in case strtoll(3) gets really confused with a huge number of digits.

lcn2 commented 2 years ago

Do you have a recommended way to go about this? The first thing that popped into my head is strptime() but I'm not sure if that's the best/better way to go about it.

The use of strptime() is good idea.

lcn2 commented 2 years ago

It is hard for us to determine what parts of a long reply need to be responded to and what are just comments. It is easy for us to miss something within such a long reply, sorry.

Perhaps single issue messages might make that easier? Anyway if we missed a question, please ask it again, perhaps as 1 question (or 1 question set) per post?

xexyl commented 2 years ago

Yes. If the value is numeric, then there MUST be no quotes around the value. In JSON:

"123" != 123

This makes sense.

Only strings are names appear to be in double quotes. JSON values such these are not double quoted:

  • true
  • false
  • null
  • 12345
  • -123
  • 12.345
  • 12.345e-7

In the JSON used by IOCCC, we do not have use for numeric values that are non-integers. So the last 2 value forms may be safely ignored.

This is helpful, thanks. I'll also take a look at the JSON documentation you provided - but tomorrow. Just quickly going through this with any comments and then turning in for the night.

xexyl commented 2 years ago

For the above two we can just use strtok_r() on : and then when finding the right first field do a strcmp() on the defines.

We are sure that you remember that the JSON elements may come in any order, and that whitespace can change without impacting the JSON validity, and that string such as "string string2" can have whitespace within them, such as in a name.

Yes I do but it's good that you made it clear. Appreciate that. One possibility is stripping the spaces out but I'll worry about the technicalities when working on it.

Edit: Ah but you're saying it because of my reference to strcmp(). Yes this does indeed matter. I haven't actually looked at the strings in question yet so this was just quick thoughts on my part. Thanks for pointing this out.

xexyl commented 2 years ago

What might be helpful here: can you provide an actual UUID string that might be valid in the contest so that testing this tool can be easier?

Here is a sample UUID:

12345678-1234-4321-abcd-1234567890ab

For more info see this other comment.

Thanks. A short bit ago I remembered the tool uuidgen but it's helpful that you provide a specific one I can test. Still because the parsing is already in fnamchk that's probably sufficient - I wasn't sure on the part of variants though so this is helpful.

xexyl commented 2 years ago

Here this makes me think that the format of the JSON file does in fact matter - since you say it must be an array and the fields must be exactly the below. Does this hold everywhere else too?

Probably, yes.

Thanks for confirming this.

xexyl commented 2 years ago

The formed_timestamp value must be an integer >= MIN_TIMESTAMP as defined in limit_ioccc.h. Which integer type do you suggest? time_t?

JSON numbers can be of any length. JSON numbers are typeless. JSON integers are just a string of decimal digits of any length.

In that case (this is just a quick thought): since the values are of limited range in C it might be possible to just use strspn() (to verify that it only has digits) and then (when comparing to a max) use strcmp() (since one can check for <, > etc. - not just == 0). This is just a quick thought though and maybe I'll come up with a different way (or you have a preference and you cay say).

See the number section of the JSON spec.

Will do.

You need not support huge multi-precision numbers. Instead try to form a long long.

Okay.

You might want to look at the length of the characters of a JSON number. Now LLONG_MAX is: 0x7fffffffffffffff == 9223372036854775807

Yep. And I actually wondered about this. At least if you mean for each JSON number count the number of digits. Is this what you mean?

And in decimal, 9223372036854775807 has 19 digits. So define:

#define LLONG_MAX_BASE10_DIGITS (19)

Then if the length (not counting and leading - sign) of the JSON number exceeds LLONG_MAX_BASE10_DIGITS, reject it as being too large.

Good idea. I actually wrote a function years ago that counts the number of digits in an int - but since this will be parsed as a string (initially) I can just use strlen().

Then use strtoll(3) to convert the characters of the JSON number into a long long, taking care (as perviously discussed) about detecting when errno is changed to non-zero (by presetting it to 0 before the strtoll(3) call) and doing all that stuff about rejecting LLONG_MIN and LLONG_MAX, and finally assigning it to a long long value.

That sounds good and reasonable.

While the checking length (ignoring any leading - sign) of the JSON number is option, it might still be a good idea in case strtoll(3) gets really confused with a huge number of digits.

And any + too for that matter (I don't know if that's valid in JSON but it certainly is in the strto*() functions.

The way I read this is:

Is that what you're saying? That seems reasonable to me at a quick glance.

xexyl commented 2 years ago

Do you have a recommended way to go about this? The first thing that popped into my head is strptime() but I'm not sure if that's the best/better way to go about it.

The use of strptime() is good idea.

Okay I'll consider that then. Thanks.

xexyl commented 2 years ago

It is hard for us to determine what parts of a long reply need to be responded to and what are just comments. It is easy for us to miss something within such a long reply, sorry.

No need to be sorry. Actually I'm sorry: I thought just quoting the specific parts would be enough. I'll make sure to do a single question in a single comment in the future. I should know this but I tend to write a lot - apologies!

Perhaps single issue messages might make that easier? Anyway if we missed a question, please ask it again, perhaps as 1 question (or 1 question set) per post?

I'll ask anything again - in a single message - if anything else comes up (or I should say when something else comes up). But no worries if you missed anything.

Perhaps you have some things you can add to my above comments and I'll read them in the morning. I'm going to check the other thread and if nothing else was posted there I'll turn in for the night.

Take good care and enjoy the rest of your day! More tomorrow. I'm sure I'll make a pull request in the morning but almost certainly after that nothing until Monday.

xexyl commented 2 years ago

Something occurred to me that you did not address: should certain characters be verified that they're escaped? For example should the URL have the /s escaped like it is printed out by mkiocccentry?

Also should the mkiocccentry tool check for characters like \ in URLs? (Maybe it already does and I don't recall).

Ah, there's this question you did miss. Should the JSON validator detect / without a \ before it?

lcn2 commented 2 years ago

Ah, there's this question you did miss. Should the JSON validator detect / without a \ before it?

We think so. The JSON spec seems to suggest that for some reason, /'s needs to be -escaped.

See json_putc() in mkiocccentry for the list of chars that need to be escaped:

/*
 * json_putc - print a UTF-8 character with JSON encoding
 *
 * JSON string encoding JSON string encoding.
 *
 * These escape characters are required by JSON:
 *
 *     old              new
 *     --------------------
 *      "               \"
 *      /               \/
 *      \               \\
 *      <backspace>     \b      (\x08)
 *      <tab>           \t      (\x09)
 *      <newline>       \n      (\x0a)
 *      <vertical tab>  \f      (\x0c)
 *      <enter>         \r      (\x0d)
 *
 * These escape characters are implied by JSON due to
 * HTML and XML encoding, although not strictly required:
 *
 *     old              new
 *     --------------------
 *      <               \u003C
 *      >               \u003E
 *      &               \u0026
 *
 * These escape characters are implied by JSON to let humans
 * view JSON without worrying about characters that might
 * not display / might not be printable:
 *
 *     old              new
 *     --------------------
 *      \x00-\x07       \u0000 - \u0007
 *      \x0e-\x1f       \u0005 - \x001f
 *      \x7f-\xff       \u007f - \u00ff
 *
 * See:
 *
 *      https://developpaper.com/escape-and-unicode-encoding-in-json-serialization/
 *
 * NOTE: We chose to not escape '%' as was suggested by the above URL
 *       because it is neither required by JSON nor implied by JSON.
xexyl commented 2 years ago

Ah, there's this question you did miss. Should the JSON validator detect / without a \ before it?

We think so. The JSON spec seems to suggest that for some reason, /'s needs to be -escaped.

See json_putc() in mkiocccentry for the list of chars that need to be escaped:

/*
 * json_putc - print a character with JSON encoding
 *
 * JSON string encoding JSON string encoding.  We will encode as follows:
 *
 *     old              new
 *     --------------------
 *      "               \"
 *      /               \/
 *      \               \\
 *      <backspace>     \b
 *      <vertical tab>  \f
 *      <tab>           \t
 *      <enter>         \r
 *      <newline>       \n
 *      <               \u003C
 *      >               \u003E
 *      &               \uoo26
 *      %               \u0025
 *      \x80-\xff       \u0080 - \u00ff
 */

That's very helpful, thanks. I guess that means when parsing the fields one will have to keep track of the previous character too so that when they encounter a character that has to be escaped and the previous character was not \ then it's an error so when the tool ends it'll return non-zero which means that if running in mkiocccentry the latter will abort too.

Perhaps there could be a function that does the checking: it would take the previous character and the current character and if the current character is one of the above and the previous character is not then it's an error. That would make it more modular and cleaner.

Well as you'll see I did a pull request with two checks added to txzchk that I think you'll appreciate (one might be called redundant and I almost had it so that a file called JUST . be only a file called JUST . but I made it that AND an invalidly named dot file). The other one is checking for non-portable characters in the FULL filename (not just the basename).

The latter test can be modified a bit so that it can be used in the json checker (but since this is only basename perhaps the code in mkiocccentry will suffice: move that part of the function that checks extra data files to a new function).

xexyl commented 2 years ago

On the subject of escaping: I checked the mkiocccentry comments like you suggested and I went to the website you referred to (https://developpaper.com/escape-and-unicode-encoding-in-json-serialization/). This in turn made me wonder:

What to do about characters > \u00ff? I guess this is an error and so the tool should return a value > 0?

I haven't started working on any of the parsing yet; for now I'm wanting to get these things clarified. I might work on some of the parsing in a little bit but I'm not sure if I'll have the time and energy (time yes for now but not sure if I have both).

lcn2 commented 2 years ago

We can assume UTF-8 throughout the tool chain. As the encoding article recommends, JSON tools should insist that the \ be followed ONLY by these characters:

The uxxxx is a special case where xxxx are hex characters 0123456789abcdefABCDEF (either lower or upper case).

While tools are insistent in terms of what they produce, they should be generous in terms of what they accept.

Moreover, JSON tools should flag as an error, if any of the following UTF-8 characters are found when NOT preceded by a \:

The following encodings are encouraged (and implied) but not required by the JSON definition:

We suggest that you may wish to create a utility function that converts JSON encoded strings into a malloced un-encoded string. That is:

char * json_decode(char const *json_string)

The json_decode() function should return a malloced UTF-8 string where the JSON -escaped characters are de-converted, or return NULL if the string was improperly encoded. The dbg() function should be used by json_decode() to inform the user of any encoding problems found. The json_decode() function should not call an error function, but rather return NULL to let the calling function decide what to do.

And what you are at it, the following reverse utility function should be written:

char * json_encode(char const *utf8_string)

The json_encode() function should return a malloced JSON encoded string, or NULL if there is a malloc error. The json_encode() function should not call an error function, but rather return NULL to let the calling function decide what to do. Also it should NOT enclose the string in " (ASCII double-quote) but rather let the calling function do that.

The json_encode() and json_decode() functions should be proper inverses of each other, unless they return NULL.

For testing and "general tool usefulness", two utilities are in order:

jstrencode [-h] [-v level] [-V] [string ...]
jstrdecode [-h] [-v level] [-V] [string ...]

These utility tools write to stdout the JSON string encoding or decoding of their input. If no string args are given, then data should be read in stdin until EOF. They should exit 0 is all is well, or exit non-0 if an error (such as NULL is returned) is encountered.

As with Unix tools, the output of one should be able to be fed into the other, so:

jstrencode < foo | jstrdecode > bar
if ! cmp foo bar; then
    echo "foo and bar differ"
fi 

Then you can add to the test rule, testing for JSON string encoding/decoding as well as tests for detecting improper JSON string encoding.

jstrdecode '\error' >/dev/null
status = "$?"
if [[ $status != 0 ]]; then
    echo "Improper decoding not detected" 1>&2
    exit "$status"
fi

Etc.

xexyl commented 2 years ago

Please let me know if this reply is okay or if it should be split off into other messages. I wasn't quite sure of this since it's all one reply I'm replying to. If necessary I'll rewrite it at another time. Just let me know and I'll be happy to do that - then you can disregard the below (I just am about to head off for the day - well I'll be at the computer a bit longer but I'm afraid I'm done with this for the day).

I was thinking one of the next things I'll do is add to the processing of lines identifying which field it is so that the only thing left to be done is to parse it (the arrays will of course be different but I'll worry about that later). As you know I also already detect whether the file starts with a { and ends with a }. I think the tools you suggest below should be done before this though so I'll work on that - but not until tomorrow (probably tomorrow).

Anyway reply below. Have a great rest of your day!

We can assume UTF-8 throughout the tool chain. As the encoding article recommends, JSON tools should insist that the \ be followed ONLY by these characters:

In other words: if any other character has a \ before it is an error, right?

While tools are insistent in terms of what they produce, they should be generous in terms of what they accept.

But not when validating the file, right? I mean if it has for example \_ it should fail, right? That's just an example; I'm not sure how you mean generous in what they accept: in what way should they be generous?

We suggest that you may wish to create a utility function that converts JSON encoded strings into a malloced un-encoded string. That is:

char * json_decode(char const *json_string)
char * json_encode(char const *utf8_string)

I guess that the json_putc() function can help with this since it kind of does that already. Then as for the inverse it's just a matter of reversing the logic. This is a good idea you have.

The json_encode() and json_decode() functions should be proper inverses of each other, unless they return NULL.

Makes sense.

For testing and "general tool usefulness", two utilities are in order:

jstrencode [-h] [-v level] [-V] [string ...]
jstrdecode [-h] [-v level] [-V] [string ...]

In other words the above functions can be put in json.c and the two tools you suggest would make use of those functions. That sounds like a good idea and probably the next thing to be done.

As with Unix tools, the output of one should be able to be fed into the other, so:

jstrencode < foo | jstrdecode > bar
if ! cmp foo bar; then
    echo "foo and bar differ"
fi 

Of course. Just like my Enigma machine does. I made sure to design it that way!

Then you can add to the test rule, testing for JSON string encoding/decoding as well as tests for detecting improper JSON string encoding.

jstrdecode '\error' >/dev/null
status = "$?"
if [[ $status != 0 ]]; then
    echo "Improper decoding not detected" 1>&2
    exit "$status"
fi

I guess you're referring to make test with new scripts? I thought of making some test scripts to test fnamchk and txzchk but I haven't got to that yet - not sure if it's necessary.

lcn2 commented 2 years ago

We can assume UTF-8 throughout the tool chain. As the encoding article recommends, JSON tools should insist that the \ be followed ONLY by these characters:

In other words: if any other character has a \ before it is an error, right?

Yes.

lcn2 commented 2 years ago

While tools are insistent in terms of what they produce, they should be generous in terms of what they accept.

But not when validating the file, right? I mean if it has for example \_ it should fail, right? That's just an example; I'm not sure how you mean generous in what they accept: in what way should they be generous?

The \_ encoding should fail as JSON does not specify what to do with this encoding.

One should not be too permissive. :-)

lcn2 commented 2 years ago

I guess you're referring to make test with new scripts?

Yes.

I thought of making some test scripts to test fnamchk and txzchk but I haven't got to that yet - not sure if it's necessary.

The fnamchk test might be useful in case we (or someone else) breaks this tool someday.

The txzchk test might be more awkward. We don't think including bad compressed tarballs in this repo is a good idea. Maybe just having txzchk test the result of the compressed tarball produced by mkiocccentry-test.sh is good enough?

xexyl commented 2 years ago

While tools are insistent in terms of what they produce, they should be generous in terms of what they accept.

But not when validating the file, right? I mean if it has for example \_ it should fail, right? That's just an example; I'm not sure how you mean generous in what they accept: in what way should they be generous?

The \_ encoding should fail as JSON does not specify what to do with this encoding.

One should not be too permissive. :-)

Just checking. Thanks!

xexyl commented 2 years ago

I guess you're referring to make test with new scripts?

Yes.

I thought of making some test scripts to test fnamchk and txzchk but I haven't got to that yet - not sure if it's necessary.

The fnamchk test might be useful in case we (or someone else) breaks this tool someday.

Have an example set of names that should be tested or how the test script might work / what it might test?

The txzchk test might be more awkward. We don't think including bad compressed tarballs in this repo is a good idea. Maybe just having txzchk test the result of the compressed tarball produced by mkiocccentry-test.sh is good enough?

I was thinking that as well.

I'll be going for the day now. Have a great Sunday! More tomorrow.

xexyl commented 2 years ago

Oh and as an aside: tomorrow I'll get to updating the cake file so I'll have more information for you there too. At least that's my plan. Hopefully it'll help you though for when you make the cake later on.

But of course I want to also work on these tools.

Anyway gone now. Have a great day!

lcn2 commented 2 years ago

FYI: The JSON encoding has been improved with commit fc10a493dd6f6dbe8e50256ae57cffb9b769db41:

Improve mkiocccentry JSON encoding

The json_putc() function now encodes:

/*
 * json_putc - print a UTF-8 character with JSON encoding
 *
 * JSON string encoding JSON string encoding.
 *
 * These escape characters are required by JSON:
 *
 *     old                      new
 *     ----------------------------
 *      "                       \"
 *      /                       \/
 *      \                       \\
 *      <backspace>             \b      (\x08)
 *      <horizontal_tab>        \t      (\x09)
 *      <newline>               \n      (\x0a)
 *      <form_feed>             \f      (\x0c)
 *      <enter>                 \r      (\x0d)
 *
 * These escape characters are implied by JSON due to
 * HTML and XML encoding, although not strictly required:
 *
 *     old              new
 *     --------------------
 *      &               \u0026
 *      <               \u003c
 *      >               \u003e
 *
 * These escape characters are implied by JSON to let humans
 * view JSON without worrying about characters that might
 * not display / might not be printable:
 *
 *     old                      new
 *     ----------------------------
 *      \x00-\x07               \u0000 - \u0007
 *      \x0b                    \u000b <vertical_tab>
 *      \x0e-\x1f               \u000e - \x001f
 *      \x7f-\xff               \u007f - \u00ff
 *
 * See:
 *
 *      https://developpaper.com/escape-and-unicode-encoding-in-json-serialization/
 *
 * NOTE: We chose to not escape '%' as was suggested by the above URL
 *       because it is neither required by JSON nor implied by JSON.
 *
 * NOTE: While there exist C escapes for characters such as '\v',
 *       due to flaws in the JSON spec, we must encode such characters
 *       using the \uffff notation.
 ...

The % is no longer escaped as it is \% is not required by JSON.

Added JSON encoding of non-printable characters to keep JSON strings ASCII printable.

Added firewall on the json_putc() function to catch bogus character values such as <0 or beyond UTF-8 values.

Fixed a few typos.

xexyl commented 2 years ago

Thanks. I just did a git pull before shutting down and saw this in the log. I'm shutting down for the day now.

Will return to this tomorrow!

lcn2 commented 2 years ago

Fixed another JSON encoding issue in c5b72c731c42c611f012bbb5e6b4ab5cc2a7f440 and 032bc4574f67568882db88ed7df1e57f61c81839

lcn2 commented 2 years ago

The JSON spec, as flawed as it is, allows for both:

\uxxxx    (as in \uabcd)
\uXXXX    (as in \uABCD)

The mkiocccentry prints using the lower case hexadecimal characters.

However for purposes of validating JSON files such as .info.json and .author.json we need to allow for both types of HEX encoding.

Even cutesy CaPs are allowed:

\xAbCd
lcn2 commented 2 years ago

Getting tied of our making mistakes on JSON encoding rules, we are in the process of adding to util.c tables and encoding and decoding functions. :-)

xexyl commented 2 years ago

Just here for a moment to ask (and then gone again for the night): should I wait for these changes first before I proceed to working on the tools including the encoding/decoding tools?

Or is it okay to continue tomorrow?

I don’t know if I will be able to but I am going to try anyway and I hope to but if I should wait please let me know when you’re ready for me to continue.

I will reply to the other replies tomorrow or if not then Tuesday.

Have a great night!

On Feb 13, 2022, at 16:12, Landon Curt Noll @.***> wrote:

 Getting tied of our making mistakes on JSON encoding rules, we are in the process of adding to util.c tables and encoding and decoding functions. :-)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

lcn2 commented 2 years ago

We completed the encoding functions and test function.

It turns out there were plenty of places to make mistakes. :-( But with lots of checking we are reasonably confident that the encoding is working. Sorry for the time it took to get the table correct.

We plan to change how mkiocccentry writes JSON encoded characters into a stream to use the new encoding.

lcn2 commented 2 years ago

The JSON functions have been moved to json.c.

The json_putc( from mkiocccentey was also moved to json.c and noe uses the JSON encoding table.

lcn2 commented 2 years ago

We fixed a JSON formatting bug in .author.json with commit 5cdaf5c8d20a1cca44c8b14e28d968038722a0e8.

lcn2 commented 2 years ago

We are wring the malloc_json_decode() function now.

lcn2 commented 2 years ago

The malloc_json_decode() and malloc_json_decode_str() functions complete and tested with commit 2e2228b5addcd4e3971fa25b364b2f4eaa8e6d79.

Renamed malloc_json_str() to malloc_json_encode_str().

The JSON string encode/decode functions are now code complete and pass the jencchk() tests.

lcn2 commented 2 years ago

We will write the jstrencode and jstrdecode functions tomorrow and create a jstr-test.sh test to add to make test.

As these are independent of your jauthchk and jinfochk tools, the above mentioned work should not impact your work. Sorry for the length of time the JSON string encode/decode took. It was 1535 lines of rather hairy code :-).

xexyl commented 2 years ago

That's quite okay. I'll reply to the other stuff in a bit. Just wanted to say go get some sleep! Sleep is important.

I will hopefully get to looking at the changes soon - I have some other stuff I'd like to do whilst waking up.

Sleep well!

xexyl commented 2 years ago

Btw: I laughed out loud to the comment you have in json.c: thanks for that first birthday laugh (at 3:20 in the morning no less! :) ).

Of course I refer to:

  • "Because JSON embodies a commitment to original design flaws." :-)

Anyway hope you're getting some good sleep now and if not yet then soon.

xexyl commented 2 years ago

Just an update:

I'm not sure if I'll be able to work on any of the code today; I am so exhausted and I know from experience that programming and being exhausted do not mix well. If I can find more energy that'll be another matter (and I hope to find more energy) but that's uncertain.

Tonight I'll be going to bed later again though I'm so tired that it might be I crash before I intend. Tomorrow I'll be going to bed early so if nothing else by Wednesday I should be good.

That being said I did think of a solution to the problem of strcmp() and spaces: a special version of the function that might look like:

int     strscmp(const char *s1, const char *s2, const char *skip_chars);

It would work like strcmp() with this difference:

Go through each string together, comparing character by character. However when comparing if a character is one of the characters in skip_chars AND the same character is in the same position in the other string then skip that character in both strings until the next character that's not that; if the following character is not the same in either string act like normal strcmp(): return the difference. Otherwise keep going until the end of one or both of the strings.

In other words:

(!strscmp("   ", " ", " ") && !strscmp("t  est", "t     est", " ")) == 1)

An example where the strings would not be considered equal:

strscmp("t  est", "test", " ") != 0

...because there's no space in the second argument. I'm not really sure what the return value should be in this case but I guess whatever strcmp() would return?

Obviously if more than one character is in skip_chars it would have to do more work but since it would be in the design of the function that's okay.

--

I hope this makes sense. I'm pretty tired so possibly I typed the above C wrong. Anyway I think this would solve the problem of different number of spaces differing in the two strings. It might be that you were thinking of something else though? In that case I simultaneously might have given up a possibly great idea for the contest (that now you'd know it's me!) and still not solve the problem! :)

xexyl commented 2 years ago

As for the json.c compiling under CentOS triggers a bunch of warnings. Right now I don't feel alert enough to fix them so I'm just going to paste the output of the compiler here so you can see. If you don't fix them I'll hopefully get to it later. I believe though that I'm going to take a break (not that I've done much but I mean a break from thinking) and then maybe I'll return to it. If not I should be able to tomorrow or the next day (after tonight I should get more sleep again).

To give you an idea of how I'm not feeling alert: I keep thinking the warning of the compiler is the warn() function calls: that is the message passed into those functions are the actual warnings - so obviously it's better I don't work on it right now :(

Anyway here you are:

json.c: In function 'jencchk':
json.c:490:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:607:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c:829:2: warning: format '%x' expects argument of type 'unsigned int *', but argument 3 has type 'int *' [-Wformat=]
  ret = sscanf(jenc[i].enc, "\\u%04x%c", &hexval, &guard);
  ^
json.c: In function 'json_putc':
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
     if (c < 0 || c > 0xff) {
     ^
json.c:983:5: warning: comparison is always false due to limited range of data type [-Wtype-limits]
json.c: In function 'malloc_json_decode':
json.c:1423:8: warning: 'd' may be used uninitialized in this function [-Wmaybe-uninitialized]
    warn(__func__, "strict mode: found non-UTF-8 \\u encoding: \\u%c%c%c%c", a,b,c,d);
        ^
json.c:1423:8: warning: 'b' may be used uninitialized in this function [-Wmaybe-uninitialized]
json.c:1423:8: warning: 'a' may be used uninitialized in this function [-Wmaybe-uninitialized]
json.c: In function 'jencchk':
json.c:852:32: warning: array subscript is above array bounds [-Warray-bounds]
       mlen, (unsigned long)jenc[i].len);
                                ^
json.c: In function 'malloc_json_decode_str':
json.c:1423:8: warning: 'a' may be used uninitialized in this function [-Wmaybe-uninitialized]
    warn(__func__, "strict mode: found non-UTF-8 \\u encoding: \\u%c%c%c%c", a,b,c,d);
        ^
json.c:1024:10: note: 'a' was declared here
     char a;      /* 1st hex character after \u */
          ^
json.c:1423:8: warning: 'b' may be used uninitialized in this function [-Wmaybe-uninitialized]
    warn(__func__, "strict mode: found non-UTF-8 \\u encoding: \\u%c%c%c%c", a,b,c,d);
        ^
json.c:1026:10: note: 'b' was declared here
     char b;      /* 2nd hex character after \u */
          ^
json.c:1423:8: warning: 'd' may be used uninitialized in this function [-Wmaybe-uninitialized]
    warn(__func__, "strict mode: found non-UTF-8 \\u encoding: \\u%c%c%c%c", a,b,c,d);
        ^
json.c:1030:10: note: 'd' was declared here
     char d;      /* 4th hex character after \u */
          ^

I'll get to the replies later on. If you can answer whether the above message (idea wrt strcmp() and spaces) that would be of help (I think that's what you were getting at but I want to be sure of that).

xexyl commented 2 years ago

I just came up with an idea to determine the field that is on the current line. Based on this it should be easy (for most lines) to know how to parse the line in question. However this will have to be another day I'm afraid. I have a Zoom meeting in about 15 minutes and after that I'm going to take it easy.

I hope tomorrow to work more on it but at the latest it'll be Wednesday that I should be able to work on it - and hopefully quite a bit. I did fix one set of warnings in the json.c: namely the ones where the chars might not be initialised.

Pushed those and some other minor changes but that's all I'll do for the day.

lcn2 commented 2 years ago

I just came up with an idea to determine the field that is on the current line. Based on this it should be easy (for most lines) to know how to parse the line in question. However this will have to be another day I'm afraid. I have a Zoom meeting in about 15 minutes and after that I'm going to take it easy.

I hope tomorrow to work more on it but at the latest it'll be Wednesday that I should be able to work on it - and hopefully quite a bit. I did fix one set of warnings in the json.c: namely the ones where the chars might not be initialised.

Pushed those and some other minor changes but that's all I'll do for the day.

Keep in mind that JSON allows for very flexible whitespace. So:

{ "hello" : 123 },

is the same as:

{
"hello"
 :
123
}
,

and the same as:

{"hello":123},

and the same as:


{                     "hello"           :

                             123

                     },

etc.

Here strtok_r(3) may be of when in that is considered one separator AND multiple consecutive separators to be equivalent.

You will probably want to read the entire file into memory.

BTW: We need this read whole file into memory function_ to complete the jstrencode and jstrdecode tools, so expect something in util.c soon.

Then use strtok_r(3) to break up the buffer into non-whitespace separated blobs and begin to break apart those blocks into JSON tokens, keep in mind that valid JSON can do this:

{"confusion":" { \"hello\" : 123 } , "},

Perhaps this is your approach with your proposed strscmp function?