ioccc-src / mkiocccentry

Form an IOCCC submission as a compressed tarball file
Other
28 stars 6 forks source link

Enhancement: discuss and resolve any remaining issues with `txzchk` #334

Closed xexyl closed 2 years ago

xexyl commented 2 years ago

I'm opening this issue so that any issues that come up will not be lost in the other thread that's more OT stuff than anything else (or equally OT). I will copy paste any issues I'm aware of that I can easily access (which might depend on if Mail wants to load them properly as unfortunately GitHub makes it hard to find messages in long threads as they at different intervals don't load comments so have to scroll through long pages to find the right links .. I know what to search for I just don't know if I can get to them easily). Anyway that's what this issue is for.

If you would please assign it to me that would be great though maybe you'll want to assign it to you too since you discuss it as well: I leave that to you of course. One comment coming up shortly.

xexyl commented 2 years ago

This is your comment in reply to mine:

Yes. Size of tarball but not size of files, right?

OK, probably not MAX_TARBALL_LEN .. perhaps MAX_DIR_KSIZE instead. That was an error.

Yes, make use of the fact that the total unpacked directory must be < MAX_DIR_KSIZE*1024 bytes.

Use that value to bound file sizes and the sum of files.

So for any given file, the tar listing size must be < MAX_DIR_KSIZE*1024 bytes.

When summing file sizes found in the tar listing, round up the size to the next 1K bytes before adding it to the sum (to account for block sizes on disk that are on the order of 1K).

Watch for negative sizes.

Watch for when the sum of file sizes found in the tar listing goes negative.

Watch for when the sum of file sizes, after adding the size of another file, becomes smaller than the previous sum.

Hope that helps .. gotta run!

The issue I have here is should it be MAX_DIR_KSIZE or MAX_DIR_KSIZE * 1024? Originally I thought it should be MAX_DIR_KSIZE but looking at the value I now wonder. It does say:

#define MAX_DIR_KSIZE (27651)           /* entry directory size limit in kibibyte (1024 byte) blocks */

and the name suggests it should be * 1024 but I thought originally you noted it was not * 1024. However that could have been a mistake as well.

The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either * 1024 or not)?

Right now the following are done:

if (txz_info.file_sizes > MAX_DIR_KSIZE)
/* ... */
else if (txz_info.rounded_file_size > MAX_DIR_KSIZE)

but should it be:

if (txz_info.file_sizes > MAX_DIR_KSIZE * 1024)
/* ... */
else if (txz_info.rounded_file_size > MAX_DIR_KSIZE * 1024)

and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE * 1024? Maybe MAX_DIR_SIZE ?

Finally should any other value be compared against the value ?

Thanks! I'll check if I can find the other pending issue I'm aware of.

xexyl commented 2 years ago

...I found the issue in mail but GitHub doesn't want to load it well so for now I'll hold off. It has to do with the test script. I'm close to it in GitHub but I want to see if I can actually focus on something which right now I think means documentation. However I hope to rest as soon as the backup drive has cooled down once I can umount it (in use right now probably from either spotlight or Sophos).

Hope you're having a nice day my friend!

lcn2 commented 2 years ago

If you would please assign it to me that would be great though maybe you'll want to assign it to you too since you discuss it as well: I leave that to you of course. One comment coming up shortly.

Once we have laptop and landline access, we can consider such administrative actions. On this cell phone we will only make limited comments.

xexyl commented 2 years ago

If you would please assign it to me that would be great though maybe you'll want to assign it to you too since you discuss it as well: I leave that to you of course. One comment coming up shortly.

Once we have laptop and landline access, we can consider such administrative actions. On this cell phone we will only make limited comments.

That makes sense to me. No rush.

xexyl commented 2 years ago

As for the other issue here I think it was to do with the test script. I gave up looking for it but I have an idea what to look for I just had other things come up. Of course there might have been other things as well but this was to do with the test script - a discussion about it. I have a good idea what was stated but it might be better to have the actual comments. This will happen at another time.

xexyl commented 2 years ago

The issue I have here is should it be MAX_DIR_KSIZE or MAX_DIR_KSIZE * 1024? Originally I thought it should be MAX_DIR_KSIZE but looking at the value I now wonder. It does say:

define MAX_DIR_KSIZE (27651) / entry directory size limit in kibibyte (1024 byte) blocks /

and the name suggests it should be 1024 but I thought originally you noted it was not 1024. However that could have been a mistake as well.

For better or for worse, the idea of the MAX_DIR_KSIZE constant was to state the maximum size in 1024 byte blocks. That was the idea of the K in that constant.

We could see getting rid of this confusion and just have the value be:

#define MAX_DIR_SIZE (27651*1024)           /* entry directory size limit in bytes */

If the use of a constant created confusion, then perhaps the above change is best?

I don't mind either way. I did it as I thought you had said back then. Maybe I misread it or I was tired or something else. If I understand you right then I should change the previous define to be that and then update the name in txzchk so that it will be fine?

Or would you prefer adding the macro? Either way I won't get to it today but maybe I can tomorrow.

lcn2 commented 2 years ago

The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either * 1024 or not)?

Right now the following are done:

if (txz_info.file_sizes > MAX_DIR_KSIZE) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE) but should it be:

if (txz_info.file_sizes > MAX_DIR_KSIZE 1024) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE 1024) and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE * 1024? Maybe MAX_DIR_SIZE ?

That is an EXCELLENT question. And perhaps MAX_DIR_KSIZE is the wrong approach?

One might be tempted to use use du -k on the unpacked entry directory, however, The du(1) command talks about disk utilization not file size. So for a given file system that blocks data to a certain size, the resulting unpacked IOCCC entry directory might be larger in some cases (such as on a filesystem where blocks up to the next 8K boundary).

Beyond this we would like the txzchk tool to help us deal with huge decompression expansion BEFORE we "untar".

So the idea of the txzchk tool was conceived. People who submit to the contests could pre-check their compressed tarball and the IOCCC Judges could pre-check the compressed tarball before unpacking the entry. We want to be fair in that the people forming the entries would be using the same size test tool (I.e., txzchk) as the IOCCC Judges.

Now if someone has a clever hack on the xz compression algorithm (where a HUGE file compresses down to a tiny chunk of data), they might be able to form a compressed tarball that is under the compressed tarball size limit. However the tar -t listing shows the size of the file that will be created, and so we need to pay attention to the sizes printed by tar -t. I.e., we don't want entires that include decompression exploders that bloat the un-tarred entry directory beyond a reasonable size.

SO what algorithm should txzchk use in determining size? Realize that this algorithm needs to be explained in a simple sentence that will go into a rule, and be understood by people for whom English is not their primary language.

To resolve your question and the other issues we have raised above, what we need to do is come up with the English sentence that will go into the next IOCCC rules that controls the maximum size of the un-tarred entry AND works using the tar -t listing so that the compressed tarball can be checked prior to the un-tar.

The rule that covers the maximum size of a compressed tarball is simple. Something like this might do:

Your entry, when uploaded to the IOCCC submit server in the form of an XZ compressed tarball, must not be larger than XXX bytes.

However the rule that governs that txzchk does is different question and a different Rule sentence.

There's stuff about 1K blocks and trying to avoid filesystem block incompatibilities may be creating more complexity than it's worth. Perhaps we should completely abandon the notion of block sizes and just count bytes?

The sum of the byte lengths of files in your entry (after they have been extracted from the compressed tarball) must not be greater than XXX.

That text avoids mentioning block sizes and it's just a simple sum of file lengths.

However consider the case of an entry that makes use of zero length files. You could probably put 100,000 or more such zero length files and probably compress it down to a tarball that fits under the maximum tarball size.

A zero length file still occupies space in a directory. Moreover it would be ugly to try and build a web page for such an entry with so many files.

Now a zero length file does occupy space on the disk, particularly in the directories that contains it and inode that references it. Zero length files occupy space on the disk. A du(1) this is working properly should show that a directory with 100,000 empty files occupies a fair amount of disk space.

So does one put a limit on the number of files in an entry? Perhaps we do.

Your entry must not contain more than XXX files (this includes all directories and mandatory files).

If we did this then the txzchk tool is simplified to mainly do this:

This is just some random thoughts that we came up with at the moment. This idea is subject to change.

Nevertheless, we think the way to answer your question and to establish a proper algorithm for txzchk is to come up with the form of the IOCCC rules (relatively simple English sentences) and then change the tool to check the rule.

Comments suggestions and corrections welcome.

xexyl commented 2 years ago

The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either 1024 or not)? Right now the following are done: if (txz_info.file_sizes > MAX_DIR_KSIZE) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE) but should it be: if (txz_info.file_sizes > MAX_DIR_KSIZE 1024) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE 1024) and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE 1024? Maybe MAX_DIR_SIZE ?

That is an EXCELLENT question. And perhaps MAX_DIR_KSIZE is the wrong approach?

One might be tempted to use use du -k on the unpacked entry directory, however, The du(1) command talks about disk utilization not file size. So for a given file system that blocks data to a certain size, the resulting unpacked IOCCC entry directory might be larger in some cases (such as on a filesystem where blocks up to the next 8K boundary).

Beyond this we would like the txzchk tool to help us deal with huge decompression expansion BEFORE we "untar".

So the idea of the txzchk tool was conceived. People who submit to the contests could pre-check their compressed tarball and the IOCCC Judges could pre-check the compressed tarball before unpacking the entry. We want to be fair in that the people forming the entries would be using the same size test tool (I.e., txzchk) as the IOCCC Judges.

Now if someone has a clever hack on the xz compression algorithm (where a HUGE file compresses down to a tiny chunk of data), they might be able to form a compressed tarball that is under the compressed tarball size limit. However the tar -t listing shows the size of the file that will be created, and so we need to pay attention to the sizes printed by tar -t. I.e., we don't want entires that include decompression exploders that bloat the un-tarred entry directory beyond a reasonable size.

SO what algorithm should txzchk use in determining size? Realize that this algorithm needs to be explained in a simple sentence that will go into a rule, and be understood by people for whom English is not their primary language.

To resolve your question and the other issues we have raised above, what we need to do is come up with the English sentence that will go into the next IOCCC rules that controls the maximum size of the un-tarred entry AND works using the tar -t listing so that the compressed tarball can be checked prior to the un-tar.

The rule that covers the maximum size of a compressed tarball is simple. Something like this might do:

Your entry, when uploaded to  the IOCCC submit server, in the form of a XZ compressed tarball, must not be larger than XXX bytes.

However the rule that governs that txzchk does is different question and a different Rule sentence.

There's stuff about 1K blocks and trying to avoid filesystem block incompatibilities may be creating more complexity than it's worth. Perhaps we should completely abandon the notion of block sizes and just count bytes?

The sum of the byte lengths of files in your entry (after they have been extracted from the compressed tarball) must not be greater than XXX.

That text avoids mentioning block sizes and it's just a simple sum of file lengths.

However consider the case of an entry that makes use of zero length files. You could probably put 100,000 or more such zero length files and probably compress it down to a tarball that fits under the maximum tarball size.

A zero length file still occupies space in a directory. Moreover it would be ugly to try and build a web page for such an entry with so many files.

Now a zero length file does occupy space on the disk, particularly in the directories that contains it and inode that references it. Zero length files occupy space on the disk. A du(1) this is working properly should show that a directory with 100,000 empty files occupies a fair amount of disk space.

So does one put a limit on the number of files in an entry? Perhaps we do.

Your entry must not contain more than XXX files (this
includes all directories and mandatory files).

If we did this then the txzchk tool is simplified to mainly do this:

  • Check the file length of the compressed tarball
  • Some the file lengths as reported by tar -t
  • Count the number of lines as reported by tar -t

This is just some random thoughts that we came up with at the moment. This idea is subject to change.

Nevertheless, we think the way to answer your question and to establish a proper algorithm for txzchk is to come up with the form of the IOCCC rules (relatively simple English sentences) and then change the tool to check the rule.

Comments suggestions and corrections welcome.

Great comment as well. I will have to wait to reply until tomorrow though I'm afraid as I can barely focus my eyes. I can say though that I certainly have some ideas. Also you probably know but I tend to be really good with words so I might (depending on the circumstances) be able to help come up with ideas for use with the wording you use.

I wish I could say more now but I'm afraid I cannot do it well enough until tomorrow. I think this will be a good discussion and I'm glad I opened this issue all the more! As long as sleep is okish I should be able to reply before I have to go to the doctor and hopefully I can get some man page stuff done too.

Good day!

xexyl commented 2 years ago

I'm not sure if I have enough energy right now to address comment https://github.com/ioccc-src/mkiocccentry/issues/334#issuecomment-1239788826 fully right now so what I'm going to do is rest and have a shower after that. Have some other things I have to do before going to the doctor. I hope to squeeze a good reply to this before I have to leave but if not I'll get to it tomorrow.

Hopefully it'll be today! I'm going to try resting now. I am feeling a bit sleepy though it's highly unlikely I'll get back to sleep. I'll not have the laptop up for a good while though: I suspect not for another three hours or so at the least.

xexyl commented 2 years ago

The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either 1024 or not)? Right now the following are done: if (txz_info.file_sizes > MAX_DIR_KSIZE) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE) but should it be: if (txz_info.file_sizes > MAX_DIR_KSIZE 1024) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE 1024) and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE 1024? Maybe MAX_DIR_SIZE ?

Can't sleep so replying. Maybe can do some man pages after that but that depends. I was trying to work something else out first and maybe I shouldn't have bothered. Anyway I'll at least reply to the below and see what else I do later.

That is an EXCELLENT question. And perhaps MAX_DIR_KSIZE is the wrong approach?

With what you bring up below it might very well be.

One might be tempted to use use du -k on the unpacked entry directory, however, The du(1) command talks about disk utilization not file size. So for a given file system that blocks data to a certain size, the resulting unpacked IOCCC entry directory might be larger in some cases (such as on a filesystem where blocks up to the next 8K boundary).

Right. Block size can change the file size. I know that an empty directory will still take space too. Also as you bring up later inodes come into play.

Beyond this we would like the txzchk tool to help us deal with huge decompression expansion BEFORE we "untar".

That makes sense.

So the idea of the txzchk tool was conceived. People who submit to the contests could pre-check their compressed tarball and the IOCCC Judges could pre-check the compressed tarball before unpacking the entry. We want to be fair in that the people forming the entries would be using the same size test tool (I.e., txzchk) as the IOCCC Judges.

That makes sense too.

Now if someone has a clever hack on the xz compression algorithm (where a HUGE file compresses down to a tiny chunk of data), they might be able to form a compressed tarball that is under the compressed tarball size limit. However the tar -t listing shows the size of the file that will be created, and so we need to pay attention to the sizes printed by tar -t. I.e., we don't want entires that include decompression exploders that bloat the un-tarred entry directory beyond a reasonable size.

This all makes sense though I'm still curious if someone can manipulate a tarball so that the -t shows the wrong file size as well. I imagine if anyone could do it you could do it. Might be possible with binary editors. Not sure.

SO what algorithm should txzchk use in determining size? Realize that this algorithm needs to be explained in a simple sentence that will go into a rule, and be understood by people for whom English is not their primary language.

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

To resolve your question and the other issues we have raised above, what we need to do is come up with the English sentence that will go into the next IOCCC rules that controls the maximum size of the un-tarred entry AND works using the tar -t listing so that the compressed tarball can be checked prior to the un-tar.

The rule that covers the maximum size of a compressed tarball is simple. Something like this might do:

Your entry, when uploaded to the IOCCC submit server in the form of an XZ compressed tarball, must not be larger than XXX bytes.

A possible thought on the size issue wrt the block size. Could we use:

               blksize_t st_blksize; /* blocksize for file system I/O */

in some way? That's under linux; the macOS says:

         u_long   st_blksize;/* optimal file sys I/O ops blocksize */

and the comment in the macOS one is more descriptive. It's for optimal I/O so not a guarantee. Or maybe we can somehow use the field:

         blkcnt_t        st_blocks;        /* blocks allocated for file */

It might even be that st_size does not even care about block size? I don't know. However as far as block size goes if it does impact the file size we could maybe have two numbers: for example there might be a secondary number which would be a buffer zone for larger block sizes. This might not be needed though: I don't know.

If this does matter though the rule might have to be reworded. I won't try coming up with something until I have a better idea.

However the rule that governs that txzchk does is different question and a different Rule sentence.

There's stuff about 1K blocks and trying to avoid filesystem block incompatibilities may be creating more complexity than it's worth. Perhaps we should completely abandon the notion of block sizes and just count bytes?

Ah right. Of course. So this means bytes might not be affected by the block size? I'm not sure. I thought it did but if not then I think the number of bytes would be the better way. In fact that's how it is now just I check the MAX_DIR_KSIZE without multiplying it by 1024. Changing this would greatly simplify the rule too I would think as it's a simple value - number of bytes and nothing else (as far as size goes).

The sum of the byte lengths of files in your entry (after they have been extracted from the compressed tarball) must not be greater than XXX.

Does this mean that the tarball unpacked can be a different size limit? Any files that are compressed that would add size and those that would be a smaller size and the tarball header information would change the size but I'm not sure if this is actually considered right now. If it's not it would also have to be fixed.

That text avoids mentioning block sizes and it's just a simple sum of file lengths.

However consider the case of an entry that makes use of zero length files. You could probably put 100,000 or more such zero length files and probably compress it down to a tarball that fits under the maximum tarball size.

I would think so or certainly many files. But then the directory would be ridiculously long and could be considered (I guess there might be exceptions) abuse.

A zero length file still occupies space in a directory. Moreover it would be ugly to try and build a web page for such an entry with so many files.

The web page part especially comes to mind. I wonder if GitHub has a limit on number of files in a directory?

Now a zero length file does occupy space on the disk, particularly in the directories that contains it and inode that references it. Zero length files occupy space on the disk. A du(1) this is working properly should show that a directory with 100,000 empty files occupies a fair amount of disk space.

True.

So does one put a limit on the number of files in an entry? Perhaps we do.

Your entry must not contain more than XXX files (this includes all directories and mandatory files).

Question is how many files. The old way was 20 and this was often a burden to me because I had a lot of supplementary files. Now I guess I could have used a tarball like I guess Dave Burton did in 2018 but I didn't know if this would break the rule so I never risked it. I figured I could add files later or in one case I had a script generate the other files.

If we did this then the txzchk tool is simplified to mainly do this:

  • Check the file length of the compressed tarball
  • Sum the file lengths as reported by tar -t and check against a maximum
  • Count the number of lines as reported by tar -t and check against a maximum

I think it would actually not be so simple. There are many checks that you had me put in that still would apply. Safe file names, correct dot files etc. Or do you mean this would be simplified in the sense of file size and (a new constant) number of files?

I would think that the directory (and only one allowed so maybe only the correct directory) should not count against the limit since that's required by the rules and not a regular file.

This is just some random thoughts that we came up with at the moment. This idea is subject to change.

Of course.

Nevertheless, we think the way to answer your question and to establish a proper algorithm for txzchk is to come up with the form of the IOCCC rules (relatively simple English sentences) and then change the tool to check the rule.

If you come up with the numbers would you like me to try some wording too? I'd be happy to do so.

Comments suggestions and corrections welcome.

Thank you for this! I consider it a real honour and privilege that you care about my opinions even about limits. Well and that you care about me as a person - but this is about the contest.

lcn2 commented 2 years ago

This all makes sense though I'm still curious if someone can manipulate a tarball so that the -t shows the wrong file size as well. I imagine if anyone could do it you could do it. Might be possible with binary editors. Not sure.

Well if someone does something that is both fun and cleaver (instead of annoying), we might given them an abuse of the rules award and then adjust the rules to close down such a loophole. :-)

lcn2 commented 2 years ago

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

Sounds like a simple

   total_size += file_length;

is all that is needed, instead of some function that has complexity ... just our opinion.

UPDATE 0:

Given our comment 1242821987 we retract the comment above. A sum function is needed.

lcn2 commented 2 years ago

A possible thought on the size issue wrt the block size. ... This might not be needed though: I don't know.

We don't want to create filesystem independent rules. Just sum file lengths (NOT st_blocks as files can have holes) and be done with it.

Handle zero length files and tiny files by placing a rational limit on the number of files. Let those who need more files to use a tarball.

lcn2 commented 2 years ago

Does this mean that the tarball unpacked can be a different size limit? Any files that are compressed that would add size and those that would be a smaller size and the tarball header information would change the size but I'm not sure if this is actually considered right now. If it's not it would also have to be fixed.

The tarball (xz compressed) as a size limit. The sum of the lengths of the files in that tarball will have a limit. The number of files in the tarball will have a limit.

lcn2 commented 2 years ago

I think it would actually not be so simple. There are many checks that you had me put in that still would apply. Safe file names, correct dot files etc. Or do you mean this would be simplified in the sense of file size and (a new constant) number of files?

Well the txzchk needs to be well written, check for libc errors .. as is the case for other code in this repo.

There are rules about filenames and rules about directory paths, etc. Yes, txzchk needs to help check a number of things beyond sizes and number of files.

xexyl commented 2 years ago

This all makes sense though I'm still curious if someone can manipulate a tarball so that the -t shows the wrong file size as well. I imagine if anyone could do it you could do it. Might be possible with binary editors. Not sure.

Well if someone does something that is both fun and cleaver (instead of annoying), we might given them an abuse of the rules award and then adjust the rules to close down such a loophole. :-)

Certainly. It could theoretically even be me but the problem is that I am like most programmers and don’t like bugs in my code so if I spotted an issue I would probably want to solve it. Still it sounds kind of like a fun and funny idea!

Otoh I suspect that if someone does do this you would like me to fix it and of course I would be honoured!

xexyl commented 2 years ago

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

Sounds like a simple

   total_size += file_length;

is all that is needed, instead of some function that has complexity ... just our opinion.

I believe it actually does this but there is more than one size to keep track of?

xexyl commented 2 years ago

A possible thought on the size issue wrt the block size. ... This might not be needed though: I don't know.

We don't want to create filesystem independent rules. Just sum file lengths (NOT st_blocks as files can have holes) and be done with it.

Handle zero length files and tiny files by placing a rational limit on the number of files. Let those who need more files to use a tarball.

Agree with this. These were just quick thoughts on the problem and not a suggestion one way or another.

And good point on file holes. And what about - well my tired head can’t think of the kind of file but it’s where they can appear really big but actually the content is not that big. What am I thinking of? It’s going to bug me not being able to think of the term though it will very possibly pop into my head when trying to go to sleep.

It would be funny if it’s actually hole but I don’t think it’s that for whatever reason.

xexyl commented 2 years ago

Does this mean that the tarball unpacked can be a different size limit? Any files that are compressed that would add size and those that would be a smaller size and the tarball header information would change the size but I'm not sure if this is actually considered right now. If it's not it would also have to be fixed.

The tarball (xz compressed) as a size limit. The sum of the lengths of the files in that tarball will have a limit. The number of files in the tarball will have a limit.

Which we need to decide upon and possibly (afterwards) discuss how it might go.

lcn2 commented 2 years ago

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

Sounds like a simple

   total_size += file_length;

is all that is needed, instead of some function that has complexity ... just our opinion.

I believe it actually does this but there is more than one size to keep track of?

No .. only one total file length size for a entry.

UPDATE 0:

Given our comment 1242821987 we retract the comment above. A sum function is needed.

xexyl commented 2 years ago

I think it would actually not be so simple. There are many checks that you had me put in that still would apply. Safe file names, correct dot files etc. Or do you mean this would be simplified in the sense of file size and (a new constant) number of files?

Well the txzchk needs to be well written, check for libc errors .. as is the case for other code in this repo.

It already does doesn’t it? Typing on the phone so can’t easily check but pretty sure I did. I certainly checked for NULL pointers and free memory etc.

Were you thinking of something specific I missed?

There are rules about filenames and rules about directory paths, etc. Yes, txzchk needs to help check a number of things beyond sizes and number of files.

Right. And it does quite a few checks. It also has a pretty extensive report (depending on verbosity level) at the end though I know I failed to add the most recent checks as I wanted to get the fixes in.

xexyl commented 2 years ago

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

Sounds like a simple

   total_size += file_length;

is all that is needed, instead of some function that has complexity ... just our opinion.

I believe it actually does this but there is more than one size to keep track of?

No .. only one total file length size for a entry.

Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output?

What should it do instead and what macro should be used (max size I mean)?

lcn2 commented 2 years ago

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

Sounds like a simple

   total_size += file_length;

is all that is needed, instead of some function that has complexity ... just our opinion.

I believe it actually does this but there is more than one size to keep track of?

No .. only one total file length size for a entry.

Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output?

What should it do instead and what macro should be used (max size I mean)?

No rounding needed when file blocking is ignored (which it should be) .. just file length sum, size of the tarball, and number of files in the tarball in terns of the so-called size rules.

xexyl commented 2 years ago

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

Sounds like a simple

   total_size += file_length;

is all that is needed, instead of some function that has complexity ... just our opinion.

I believe it actually does this but there is more than one size to keep track of?

No .. only one total file length size for a entry.

Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output? What should it do instead and what macro should be used (max size I mean)?

No rounding needed when file blocking is ignored (which it should be) .. just file length sum, size of the tarball, and number of files in the tarball in terns of the so-called size rules.

So should the rounded up to nearest multiple of 1024 be removed ?

lcn2 commented 2 years ago

I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?

Sounds like a simple

   total_size += file_length;

is all that is needed, instead of some function that has complexity ... just our opinion.

I believe it actually does this but there is more than one size to keep track of?

No .. only one total file length size for a entry.

Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output? What should it do instead and what macro should be used (max size I mean)?

No rounding needed when file blocking is ignored (which it should be) .. just file length sum, size of the tarball, and number of files in the tarball in terns of the so-called size rules.

So should the rounded up to nearest multiple of 1024 be removed ?

See commit 229c0c1faa455469474e783c8160c3f6d6310cab

lcn2 commented 2 years ago

The use of MAX_DIR_KSIZE should be removed in this repo.

Code that used MAX_DIR_KSIZE (both mkiocccentry and txzchk) should instead use and test against the new MAX_SUM_FILELEN and MAX_FILE_COUNT values.

UPDATE 0

No rounding needed.

The MAX_FILE_COUNT is a file count for all files (including the required 5). Anything that is NOT a file should NOT be counted with respect to this new constant.

This value has NOT been discussed by the IOCCC judges and is thus highly subject to change. Nevertheless we know MAX_FILE_COUNT will be > 5 and < infinity :-)

When MAX_DIR_KSIZE is no longer used, it should be removed the limit_ioccc.h.

xexyl commented 2 years ago

The use of MAX_DIR_KSIZE should be removed in this repo.

Code that used MAX_DIR_KSIZE (both mkiocccentry and txzchk) should instead use and test against the new MAX_SUM_FILELEN and MAX_FILE_COUNT values.

Decided to quickly reply so I have something for tomorrow morning.

Would you please tell me what values should be compared to which macros?

Thanks you! I will get to it soon. But first a long sleep. Cheers!

xexyl commented 2 years ago

The use of MAX_DIR_KSIZE should be removed in this repo. Code that used MAX_DIR_KSIZE (both mkiocccentry and txzchk) should instead use and test against the new MAX_SUM_FILELEN and MAX_FILE_COUNT values.

Decided to quickly reply so I have something for tomorrow morning.

Would you please tell me what values should be compared to which macros?

Thanks you! I will get to it soon. But first a long sleep. Cheers!

Oh in this case: should it be <= or <? I believe because it’s max it should be the former but without looking at it I want to be sure.

Sleep time for me though I will be awake a while yet but until I lie down that process won’t start.

Sleep well when you do and welcome home! Btw am I correct that NASA had another problem with that rocket ? I hope it’s worked out soon and I am sorry you couldn’t be there.

Years ago when I was a kid they (not NASA but some rocket company) regularly did tests round here. Iirc it was Friday mornings and it was incredibly annoying. They might have even had a nuclear mess (certainly some company did here but I am not sure if it was the same company without looking). Anyway I don’t mean to write this here but I am trying to hurry off to bed.

Well more from me tomorrow. Good night!

lcn2 commented 2 years ago

The use of MAX_DIR_KSIZE should be removed in this repo. Code that used MAX_DIR_KSIZE (both mkiocccentry and txzchk) should instead use and test against the new MAX_SUM_FILELEN and MAX_FILE_COUNT values.

Decided to quickly reply so I have something for tomorrow morning.

Would you please tell me what values should be compared to which macros?

Thanks you! I will get to it soon. But first a long sleep. Cheers!

Does comment 1242819402 answer that?

lcn2 commented 2 years ago

Oh in this case: should it be <= or <? I believe because it’s max it should be the former but without looking at it I want to be sure.

As these are MAX (i.e., limits) values <= are OK (assuming of course, they are integers AND they are not negative).

lcn2 commented 2 years ago

Consider the test_txzchk/good/entry.12345678-1234-4321-abcd-1234567890ab-2.1924343546.txt file:

drwxr-xr-x  0 501    20          0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

The file count is 9 (and this is currently <= MAX_FILE_COUNT).

The sum of the file lengths is 18836 (and this is currently <= MAX_SUM_FILELEN).

Now lets consider some malformed tar listings ... ignore how such a listing might arise, just assume somehow this happens:

drwxr-xr-x  0 501    20       1000 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

The sum of the file lengths is still 18836, even though the directory is size is 1000. Only the sum of files matter.

Assume somehow this happens:

drwxr-xr-x  0 501    20       1000 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
drw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/

Now the the file lengths is just 18832 because the lengths of the directories do not count towards the sum.

Yes, this entry would be rejected because of the sub-directory too, but that issue is beyond the scope of this comment.

Assume somehow this happens:

drwxr-xr-x  0 501    20       1000 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20        4.0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r--  0 501    20        -61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       155a Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
drw-r--r--  0 501    20       fred Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/

This entry should be rejected because the length of extra2 is 4.0 which is not an integer, because the length of prog.c is negative and the length of extra1 is not a number, and because the length of .info.json is NOT a base 10 integer, and because -foo filename starts with an invalid character, etc.

Nevertheless and focusing on the topic of this comment, only these lengths would be summed with respect to the MAX_SUM_FILELEN value:

-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar

because only those files have a base 10 integer length that is not negative.

And nevertheless and focusing on the topic of this comment, only these files should be counted with respect to the MAX_FILE_COUNT value:

-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20        4.0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r--  0 501    20        -61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       155a Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json

because only those are files.

lcn2 commented 2 years ago

BTW, consider this fictional listing:

drwxr-xr-x  0 501    20          0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
crw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
brw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
lrw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
prw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
srw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
wrw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
Srw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
Lrw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

While the entry should be rejected a number of reasons, with respect to MAX_SUM_FILELEN the sum is 4 and with respect to MAX_FILE_COUNT the file count is 1.

Yes. txzchk and mkiocccentry should reject such files / reject such a tarball for various reasons. For the purpose of MAX_FILE_COUNT, ONLY files matter. For the sum of file lengths for MAX_SUM_FILELEN, ONLY non-negative integer base 10 lengths of files count towards the sum.

UPDATE 0a:

Here is some pseudo-C code for a paranoid file length sum:

code example removed in favor of commit 4857137ad46d004ba9e20b0b41dd8820b2c2dc0c

UPDATE 1a:

BTW: Such careful numeric processing comes from years of experience in writing new largest know prime finding computation code where there is NO tolerance for errors. The key is defense in depth with a rational level of code paranoia AND to make it intractable for bogus data to produce fool the count or file length sum into looking like it is valid.

There are several rather subtile aspects in the code committed below: most are the are very intentional .. except for any typos or bugs. :-) For example, we attempt to make it much harder for a stack smash to allow an invalid count or sum to pass.

p.s. We retract comment 1242812163 and comment 1242817660 based on this comment. A sum function IS needed.

UPDATE 2a:

See commit 4857137ad46d004ba9e20b0b41dd8820b2c2dc0c

Example usage:


/* ... static values private to some .c file (outside of any function) ... */

static intmax_t sum_check;
static intmax_t count_check;

/* ... at start of function that is checking the total file length sum and count ... */

intmax_t sum = 0;
intmax_t count = 0;
intmax_t length = 0;
bool test = false;

/* ... loop the following over ALL files where length_str is the length of the current file ... */

/* 
 * convert tarball file length string into a value to sum
 */
test = string_to_intmax2(length_str, &length);
if (test == false) {
    ... object to a bogus file length string ...
}

/*
 * carefully sum and count this file's length
 */
if (length < 0) {
    ... object to a negative file length ...
}
test = sum_and_count(length, &sum, &count, &sum_check, &count_check);
if (test == false) {
    ... object to internal/computational error ...
}
if (sum < 0) {
    ... object to negative total file length  ...
}
if (sum > MAX_SUM_FILELEN) {
    ... object to sum of all file lengths being too large ...
}
if (count < 0) {
    ... object to a negative file count ...
}
if (count == 0) {
    ... object to a zero file count ...
}
if (count > MAX_FILE_COUNT) {
    ... object to too many files ...
}

Of course, for code such as mkiocccentry, where you have the file length as an integer, the call to string_to_intmax2() can be skipped.

xexyl commented 2 years ago

BTW, consider this fictional listing:

drwxr-xr-x  0 501    20          0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
crw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
brw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
lrw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
prw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
srw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
wrw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
Srw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
Lrw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

While the entry should be rejected a number of reasons, with respect to MAX_SUM_FILELEN the sum is 4 and with respect to MAX_FILE_COUNT the file count is 1.

Yes. txzchk and mkiocccentry should reject such files / reject such a tarball for various reasons. For the purpose of MAX_FILE_COUNT, ONLY files matter. For the sum of file lengths for MAX_SUM_FILELEN, ONLY non-negative integer base 10 lengths of files count towards the sum.

The fact I cannot see why the tools should reject those files is telling me that I should not work on this today - or not now at least. Sorry! I hope I will feel more able to reply tomorrow. I hope you have a good day!

(EDIT: Hours later I see it ... a quick glance prevented it when I was still not very awake and with a quick glance later I saw it immediately.)

As I said elsewhere Tuesday I will be unable to do much of anything but hopefully tomorrow I should be able to do some things (including maybe work on the new issue you opened based on my comment in another thread). Tomorrow I do have a zoom meeting but that's all that's scheduled.

--

I should be able to reply to any replies to my email today though .. depending on what time the replies come in.

Going to do something else. Maybe I'll be able to focus more in a bit. I hope so.

xexyl commented 2 years ago

I'll try replying to some of this anyway. Not sure it'll be complete today though.

Consider the test_txzchk/good/entry.12345678-1234-4321-abcd-1234567890ab-2.1924343546.txt file:

drwxr-xr-x  0 501    20          0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

The file count is 9 (and this is currently <= MAX_FILE_COUNT).

The sum of the file lengths is 18836 (and this is currently <= MAX_SUM_FILELEN).

Right.

Now lets consider some malformed tar listings ... ignore how such a listing might arise, just assume somehow this happens:

Good idea to have these examples and perhaps once it's all resolved they should be in the bad subdirectory!

drwxr-xr-x  0 501    20       1000 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

The sum of the file lengths is still 18836, even though the directory is size is 1000. Only the sum of files matter.

So about that. What is the size of the directory? With the following files:

$ ls -al test
total 8
drwxr-xr-x    3 cody  staff    96 Sep 11 12:51 ./
drwxr-xr-x  270 cody  staff  8640 Sep 11 12:49 ../
-rw-r--r--    1 cody  staff    10 Sep 11 12:51 test

the number of blocks used of all the files used in that directory is 8. But if for example (under macOS) I tar the directory like so:

$ tar cvf test.tar test
a test
a test/test

and then list the contents:

$ tar fvt test.tar 
drwxr-xr-x  0 cody   staff       0 Sep 11 12:51 test/
-rw-r--r--  0 cody   staff      10 Sep 11 12:51 test/test

I see the directory size of test is 0. So what does that mean? How can it be 0 when there are blocks being used? I know with ls one can change BLOCKSIZE via one or more options and the environmental variable itself. But still why should that be 0?

Assume somehow this happens:

drwxr-xr-x  0 501    20       1000 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
drw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/

Now the the file lengths is just 18832 because the lengths of the directories do not count towards the sum.

In other words because the sum of all the files found in the tar listing results in that value, right? (I haven't tried it - it's hard to focus right now but trying to get some discussion going).

Yes, this entry would be rejected because of the sub-directory too, but that issue is beyond the scope of this comment.

Assume somehow this happens:

drwxr-xr-x  0 501    20       1000 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20        4.0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r--  0 501    20        -61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       155a Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
drw-r--r--  0 501    20       fred Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/

This entry should be rejected because the length of extra2 is 4.0 which is not an integer, because the length of prog.c is negative and the length of extra1 is not a number, and because the length of .info.json is NOT a base 10 integer, and because -foo filename starts with an invalid character, etc.

Plus 'fred' being there. But I actually wonder how this output would go with the tool now. I don't know. I know the negative sizes will change the total size but it seems like this might be in need of change, based on some of the comments (not sure if it's this one or another or more than one).

Nevertheless and focusing on the topic of this comment, only these lengths would be summed with respect to the MAX_SUM_FILELEN value:

-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar

because only those files have a base 10 integer length that is not negative.

So what should be done with the invalid lines ? Certainly they should count against the entry but in what way? I see you wrote a function (that I've not had time to look at) so maybe this will do what I need but having clarity here would also be good please.

And nevertheless and focusing on the topic of this comment, only these files should be counted with respect to the MAX_FILE_COUNT value:

-rw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r--  0 501    20        4.0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r--  0 501    20        -61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r--  0 501    20       155a Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json

because only those are files.

Regular files. Yes. I had thought of that earlier - because I thought you said in some comment that directories would count too. I found that strange but maybe I misread it or the thought/idea was changed.

xexyl commented 2 years ago

BTW, consider this fictional listing:

drwxr-xr-x  0 501    20          0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
crw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
brw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
lrw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
prw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
srw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
wrw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
Srw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
Lrw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

While the entry should be rejected a number of reasons, with respect to MAX_SUM_FILELEN the sum is 4 and with respect to MAX_FILE_COUNT the file count is 1.

Yes. txzchk and mkiocccentry should reject such files / reject such a tarball for various reasons. For the purpose of MAX_FILE_COUNT, ONLY files matter. For the sum of file lengths for MAX_SUM_FILELEN, ONLY non-negative integer base 10 lengths of files count towards the sum.

What am I missing here? What should be rejected? At a quick glance anyway they seem to be the usual. Probably will be obvious later on when looking again tomorrow or after you point out something that's right in front of me.

(EDIT: No need to answer this .. looking at the actual comment again and it immediately was visible what is wrong with these.)

UPDATE 0a:

Here is some pseudo-C code for a paranoid file length sum:

code example removed in favor of commit 4857137

UPDATE 1a:

BTW: Such careful numeric processing comes from years of experience in writing new largest know prime finding computation code where there is NO tolerance for errors. The key is defense in depth with a rational level of code paranoia AND to make it intractable for bogus data to produce fool the count or file length sum into looking like it is valid.

I imagine so! (And I hope you picked up on the pun :-) )

There are several rather subtile aspects in the code committed below: most are the are very intentional .. except for any typos or bugs. :-) For example, we attempt to make it much harder for a stack smash to allow an invalid count or sum to pass.

Just to be clear. Since you seem to have rolled back some ideas. The code you refer to in the part of the comment this is replying to - that still is in the repo, right?

p.s. We retract comment 1242812163 and comment 1242817660 based on this comment. A sum function IS needed.

UPDATE 2a:

See commit 4857137

Example usage:

/* ... static values private to some .c file (outside of any function) ... */

static intmax_t sum_check;
static intmax_t count_check;

/* ... at start of function that is checking the total file length sum and count ... */

intmax_t sum = 0;
intmax_t count = 0;
intmax_t length = 0;
bool test = false;

/* ... loop the following over ALL files where length_str is the length of the current file ... */

/* 
 * convert tarball file length string into a value to sum
 */
test = string_to_intmax2(length_str, &length);
if (test == false) {
    ... object to a bogus file length string ...
}

/*
 * carefully sum and count this file's length
 */
if (length < 0) {
    ... object to a negative file length ...
}
test = sum_and_count(length, &sum, &count, &sum_check, &count_check);
if (test == false) {
    ... object to internal/computational error ...
}
if (sum < 0) {
    ... object to negative total file length  ...
}
if (sum > MAX_SUM_FILELEN) {
    ... object to sum of all file lengths being too large ...
}
if (count < 0) {
    ... object to a negative file count ...
}
if (count == 0) {
    ... object to a zero file count ...
}
if (count > MAX_FILE_COUNT) {
    ... object to too many files ...
}

Of course, for code such as mkiocccentry, where you have the file length as an integer, the call to string_to_intmax2() can be skipped.

Of course (on the not using that function).

But just to be clear: these checks should be added after all lines have been parsed, right? If so that should not be a problem as it'll all be stored in struct txz_line by a linked list txz_lines. I could just iterate through them all and flag any issues via the struct txz_info.

In order for me to really get into this though I'll have to be in a better state. I'm afraid that's probably all I can do today.

That being said I will ask you finally: since this has to be done for mkiocccentry - where would it be done? I mean some of it - like counting the files - is done indirectly via txzchk. So maybe worded better is: what parts need to be added to mkiocccentry too and where ?

xexyl commented 2 years ago

BTW, consider this fictional listing:

drwxr-xr-x  0 501    20          0 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
crw-r--r--  0 501    20       1854 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
brw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
lrw-r--r--  0 501    20       2815 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
prw-r--r--  0 501    20         61 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
srw-r--r--  0 501    20       2859 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
wrw-r--r--  0 501    20       4454 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
Srw-r--r--  0 501    20       5235 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
Lrw-r--r--  0 501    20       1550 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r--  0 501    20          4 Jun  4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1

Oh! I see now. Looking at the actual comment makes it easier. It's non-regular files. These are already checked so should be fine.

EDIT 0

Actually the way it is done even allows for that by some crazy chance a new type is created as it checks for just valid chars via strspn(). So it's safe from some change to POSIX or some bogus implementation of tar / whatever.

lcn2 commented 2 years ago

So about that. What is the size of the directory? With the following files:

From the Rule 2 perspective, we don't care. A proper IOCCC entry will just have a single directory as sub-directories in the source are NOT allowed. This isn't to say that there can never be a sub-directory. An IOCCC entry is free to create sub-directories via their Makefile or via their program, etc. But from the perspective of the XZ compressed tarball, there is ONLY one directory. As Rule 2 will focus on files in that one directory, the space that the one directory occupies is NOT considered in the size.

An advantage of ignoring the size of the one directory is that we avoid filesystem specific directory issues.

So ignore the one directory, both in terms of count and the sum of the file lengths, because it is a directory.

To prevent someone from grossly abusing the one directory and filling it up with zero length files, we limit the number of files to MAX_FILE_COUNT. So again, the one directory can be ignored.

lcn2 commented 2 years ago

So what should be done with the invalid lines ? Certainly they should count against the entry but in what way? I see you wrote a function (that I've not had time to look at) so maybe this will do what I need but having clarity here would also be good please.

There is more than one reason to reject an entry. :-)

From the sum_and_count() function perspective, if the tar listing is a file, sum and count it, otherwise ignore it from that function's perspective.

Of course a bogus filename, 2nd directory, a directory that is NOT a top level directory, duplicate filenames, something that is NOT a file nor the top level directory, malformed tar listing lines, lines that have a username / group name instead of a UID/ GID, etc. All of these are reasons for txzchk to reject the entry. Just perhaps NOT for Rule 2 reasons. :-)

A similar rule applies to mkiocccentry, but here the tool is forming a directory and dealing with files to copy into that directory. True, mkiocccentry will run txzchk in the XZ compressed tarball that it formed (as a sanity check), but similar checking from Rule 2 and similar checking for that other stuff applies.

FYI: Rule 2 used to focus only on the size of prog.c and iocccsize stuff. The txzchk tool doesn't concern itself with the size of prog.c especially as the user may have requested a rule_2a_override or rule_2b_override. However their won't be a Rule 2c nor Rule 2d, nor etc. override so they will not be able to submit a XZ compressed tarball larger than MAX_TARBALL_LEN, nor a sum if file lengths larger than MAX_SUM_FILELEN, nor more than MAX_FILE_COUNT files.

lcn2 commented 2 years ago

Plus 'fred' being there. But I actually wonder how this output would go with the tool now. I don't know.

Some of these listing ideas should be put under test_txzchk/bad/ for testing purposes.

lcn2 commented 2 years ago

But just to be clear: these checks should be added after all lines have been parsed, right? If so that should not be a problem as it'll all be stored in struct txz_line by a linked list txz_lines. I could just iterate through them all and flag any issues via the struct txz_info.

It is up to you as far as how you want txzchk to handle it. Just as long as all tarball file listing field strings are processed by string_to_intmax2() to attempt to get a file length (or fail because it is something like fred or 3.0 or 123a or 0123) AND checked if the file length if > 0 AND for those that are, passed to sum_and_count() for summing and counting. How do that is up to you so long and all files in the tar listing are processed.

In order for me to really get into this though I'll have to be in a better state. I'm afraid that's probably all I can do today.

Best wishes on your state change for the better!

lcn2 commented 2 years ago

That being said I will ask you finally: since this has to be done for mkiocccentry - where would it be done? I mean some of it - like counting the files - is done indirectly via txzchk. So maybe worded better is: what parts need to be added to mkiocccentry too and where ?

Well in the case of mkiocccentry you are NOT dealing with a tar listing, but rather the file length from a stat(2) call. So there isn't a need to call string_to_intmax2(). Just pass the st_size file directly to sum_and_count() when the time is a file (i.e., when (st_mode&S_IFMT) == S_IFREG is true).

Yes, mkiocccentry should sum and count and check the result. If the sum or count is exceeded, issue an error and decline to form a compressed tarball, just as if it was given a bogus filename, or a file that does not exist, or a directory, or some special non-file/non-directory, etc.

lcn2 commented 2 years ago

Just to be clear. Since you seem to have rolled back some ideas. The code you refer to in the part of the comment this is replying to - that still is in the repo, right?

Correct.

xexyl commented 2 years ago

So about that. What is the size of the directory? With the following files:

From the Rule 2 perspective, we don't care. A proper IOCCC entry will just have a single directory as sub-directories in the source are NOT allowed. This isn't to say that there can never be a sub-directory. An IOCCC entry is free to create sub-directories via their Makefile or via their program, etc. But from the perspective of the XZ compressed tarball, there is ONLY one directory. As Rule 2 will focus on files in that one directory, the space that the one directory occupies is NOT considered in the size.

Right. I didn't think it counted against files - would not seem fair even. But this makes me wonder if I actually do include the directory size as part of the sum. I'll have to check that one.

What if it's a subdirectory? Of course it'll be rejected but what should the action be as far as size goes? Also I was more generally asking what the directory size is supposed to mean. I guess it depends on the block size too but I'm not sure what it's supposed to mean as clearly the example I gave took more than 0 bytes and the directory size was reported as 0.

An advantage of ignoring the size of the one directory is that we avoid filesystem specific directory issues.

Of course.

So ignore the one directory, both in terms of count and the sum of the file lengths, because it is a directory.

Well that goes back to the other thought I had - if we ignore one directory what about the others? And do we decide which one to ignore? I would think the correct directory for the entry based on fnamchk though of course if that tool fails we don't have that information available.

To prevent someone from grossly abusing the one directory and filling it up with zero length files, we limit the number of files to MAX_FILE_COUNT. So again, the one directory can be ignored.

Right. But what about other directories? It'll be rejected as an invalid entry but what should be done with it as far as reporting? I like to be complete as you know!

xexyl commented 2 years ago

So what should be done with the invalid lines ? Certainly they should count against the entry but in what way? I see you wrote a function (that I've not had time to look at) so maybe this will do what I need but having clarity here would also be good please.

There is more than one reason to reject an entry. :-)

That's true.

From the sum_and_count() function perspective, if the tar listing is a file, sum and count it, otherwise ignore it from that function's perspective.

As above what if it's not the expected directory? Do I sum those up? What about files inside those subdirectories ?

Of course a bogus filename, 2nd directory, a directory that is NOT a top level directory, duplicate filenames, something that is NOT a file nor the top level directory, malformed tar listing lines, lines that have a username / group name instead of a UID/ GID, etc. All of these are reasons for txzchk to reject the entry. Just perhaps NOT for Rule 2 reasons. :-)

Which of course it does already for all of these though as far as malformed lines I did put in the BUGS section of the man page to please report it if someone comes across a format we're unfamiliar with. Obviously if they're doing this to abuse it'll be rejected as a request. But then this brings up an interesting idea I just had:

What if someone prior to the contest opening (or during) suggests that they found a new format that's not actually real and they m mean to abuse the adding of it somehow? Is this a concern from the IOCCC judging perspective?

A similar rule applies to mkiocccentry, but here the tool is forming a directory and dealing with files to copy into that directory. True, mkiocccentry will run txzchk in the XZ compressed tarball that it formed (as a sanity check), but similar checking from Rule 2 and similar checking for that other stuff applies.

You mean that the mkiocccentry also does these tests? Of course as far as duplicate files go I would think that it would not be possible the way the mkiocccentry does it since it would overwrite the files each time. At least I would think so - I haven't looked at how it's done in a long while.

FYI: Rule 2 used to focus only on the size of prog.c and iocccsize stuff. The txzchk tool doesn't concern itself with the size of prog.c especially as the user may have requested a rule_2a_override or rule_2b_override. However their won't be a Rule 2c nor Rule 2d, nor etc. override so they will not be able to submit a XZ compressed tarball larger than MAX_TARBALL_LEN, nor a sum if file lengths larger than MAX_SUM_FILELEN, nor more than MAX_FILE_COUNT files.

Yes. But by 'used to focus on' .. do you mean that it will focus on other things now too? If so that would require a change in a number of tools. I actually from an earlier comment you made wondered if this is happening. Or did I misunderstand?

xexyl commented 2 years ago

Plus 'fred' being there. But I actually wonder how this output would go with the tool now. I don't know.

Some of these listing ideas should be put under test_txzchk/bad/ for testing purposes.

I had the same idea - but only after anything needing fixing is fixed - else it would break make test.

xexyl commented 2 years ago

But just to be clear: these checks should be added after all lines have been parsed, right? If so that should not be a problem as it'll all be stored in struct txz_line by a linked list txz_lines. I could just iterate through them all and flag any issues via the struct txz_info.

It is up to you as far as how you want txzchk to handle it. Just as long as all tarball file listing field strings are processed by string_to_intmax2() to attempt to get a file length (or fail because it is something like fred or 3.0 or 123a or 0123) AND checked if the file length if > 0 AND for those that are, passed to sum_and_count() for summing and counting. How do that is up to you so long and all files in the tar listing are processed.

Of course. It was kind of a question too though. Previously (I think) each time a new size is parsed it is added to the total. However now it should use this new set of code (and perhaps some of it - the tests - should be put in a separate function .. question is does it belong in txzchk or could it be used in other tools so maybe better fit in util.c ?) so I wonder if maybe it has to wait until all lines are parsed. I'm not sure how the function works yet.

Then again I'll have to look at how I have it working now and I am unfortunately too tired to really do anything with it today :(

Okay looking at that function briefly I see it takes a pointer to a previous size so I can just use it each time I encounter a new file. That means I don't have to wait until after all lines are parsed. Much of it can stay the same but the difference is that I now have to use the new functions and tests in the code rather than the tests I have. Then depending on the result I can flag issues. Some of the issues might need to be detected differently though - not sure yet.

In order for me to really get into this though I'll have to be in a better state. I'm afraid that's probably all I can do today.

Best wishes on your state change for the better!

Thank you! Unfortunately I woke up way too early again so I'm not sure what I can do today but at least the discussion should be able to continue.

xexyl commented 2 years ago

That being said I will ask you finally: since this has to be done for mkiocccentry - where would it be done? I mean some of it - like counting the files - is done indirectly via txzchk. So maybe worded better is: what parts need to be added to mkiocccentry too and where ?

Well in the case of mkiocccentry you are NOT dealing with a tar listing, but rather the file length from a stat(2) call. So there isn't a need to call string_to_intmax2(). Just pass the st_size file directly to sum_and_count() when the time is a file (i.e., when (st_mode&S_IFMT) == S_IFREG is true).

Of course. But that means for each file processed it should kind of do the same thing as txzchk will do only not from a string but an actual int. This suggests to me (if I follow you right) that this should be a new set of functions in util.c. Want me to write these and then integrate them into both tools? I can do that though I don't know if that'll be today or else (at the earliest as most likely tomorrow is completely shot) Wednesday.

Yes, mkiocccentry should sum and count and check the result. If the sum or count is exceeded, issue an error and decline to form a compressed tarball, just as if it was given a bogus filename, or a file that does not exist, or a directory, or some special non-file/non-directory, etc.

Right.

xexyl commented 2 years ago

Just to be clear. Since you seem to have rolled back some ideas. The code you refer to in the part of the comment this is replying to - that still is in the repo, right?

Correct.

Thanks. Though of course I no longer know what the code is! But I don't think I need to as the function names are above.

xexyl commented 2 years ago

FYI: I'm waiting on hearing back on the above before I make the fixes. I'm not sure if this is necessary or not but I'd rather only have to do it once.

lcn2 commented 2 years ago

So about that. What is the size of the directory? With the following files:

From the Rule 2 perspective, we don't care. A proper IOCCC entry will just have a single directory as sub-directories in the source are NOT allowed. This isn't to say that there can never be a sub-directory. An IOCCC entry is free to create sub-directories via their Makefile or via their program, etc. But from the perspective of the XZ compressed tarball, there is ONLY one directory. As Rule 2 will focus on files in that one directory, the space that the one directory occupies is NOT considered in the size.

Right. I didn't think it counted against files - would not seem fair even. But this makes me wonder if I actually do include the directory size as part of the sum. I'll have to check that one.

What if it's a subdirectory? Of course it'll be rejected but what should the action be as far as size goes? Also I was more generally asking what the directory size is supposed to mean. I guess it depends on the block size too but I'm not sure what it's supposed to mean as clearly the example I gave took more than 0 bytes and the directory size was reported as 0.

An advantage of ignoring the size of the one directory is that we avoid filesystem specific directory issues.

Of course.

So ignore the one directory, both in terms of count and the sum of the file lengths, because it is a directory.

Well that goes back to the other thought I had - if we ignore one directory what about the others? And do we decide which one to ignore? I would think the correct directory for the entry based on fnamchk though of course if that tool fails we don't have that information available.

To prevent someone from grossly abusing the one directory and filling it up with zero length files, we limit the number of files to MAX_FILE_COUNT. So again, the one directory can be ignored.

Right. But what about other directories? It'll be rejected as an invalid entry but what should be done with it as far as reporting? I like to be complete as you know!

Sub-directories, not being files, do not count nor do they sum.

Files, even those in sub-directories, are files and therefore they count and sum.

Files count and sum. Non-files don't.