Closed xexyl closed 2 years ago
This is your comment in reply to mine:
Yes. Size of tarball but not size of files, right?
OK, probably not
MAX_TARBALL_LEN
.. perhapsMAX_DIR_KSIZE
instead. That was an error.Yes, make use of the fact that the total unpacked directory must be <
MAX_DIR_KSIZE*1024
bytes.Use that value to bound file sizes and the sum of files.
So for any given file, the tar listing size must be <
MAX_DIR_KSIZE*1024
bytes.When summing file sizes found in the tar listing, round up the size to the next 1K bytes before adding it to the sum (to account for block sizes on disk that are on the order of 1K).
Watch for negative sizes.
Watch for when the sum of file sizes found in the tar listing goes negative.
Watch for when the sum of file sizes, after adding the size of another file, becomes smaller than the previous sum.
Hope that helps .. gotta run!
The issue I have here is should it be MAX_DIR_KSIZE
or MAX_DIR_KSIZE * 1024
? Originally I thought it should be MAX_DIR_KSIZE
but looking at the value I now wonder. It does say:
#define MAX_DIR_KSIZE (27651) /* entry directory size limit in kibibyte (1024 byte) blocks */
and the name suggests it should be * 1024
but I thought originally you noted it was not * 1024
. However that could have been a mistake as well.
The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either * 1024
or not)?
Right now the following are done:
if (txz_info.file_sizes > MAX_DIR_KSIZE)
/* ... */
else if (txz_info.rounded_file_size > MAX_DIR_KSIZE)
but should it be:
if (txz_info.file_sizes > MAX_DIR_KSIZE * 1024)
/* ... */
else if (txz_info.rounded_file_size > MAX_DIR_KSIZE * 1024)
and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE * 1024
? Maybe MAX_DIR_SIZE
?
Finally should any other value be compared against the value ?
Thanks! I'll check if I can find the other pending issue I'm aware of.
...I found the issue in mail but GitHub doesn't want to load it well so for now I'll hold off. It has to do with the test script. I'm close to it in GitHub but I want to see if I can actually focus on something which right now I think means documentation. However I hope to rest as soon as the backup drive has cooled down once I can umount it (in use right now probably from either spotlight or Sophos).
Hope you're having a nice day my friend!
If you would please assign it to me that would be great though maybe you'll want to assign it to you too since you discuss it as well: I leave that to you of course. One comment coming up shortly.
Once we have laptop and landline access, we can consider such administrative actions. On this cell phone we will only make limited comments.
If you would please assign it to me that would be great though maybe you'll want to assign it to you too since you discuss it as well: I leave that to you of course. One comment coming up shortly.
Once we have laptop and landline access, we can consider such administrative actions. On this cell phone we will only make limited comments.
That makes sense to me. No rush.
As for the other issue here I think it was to do with the test script. I gave up looking for it but I have an idea what to look for I just had other things come up. Of course there might have been other things as well but this was to do with the test script - a discussion about it. I have a good idea what was stated but it might be better to have the actual comments. This will happen at another time.
The issue I have here is should it be MAX_DIR_KSIZE or MAX_DIR_KSIZE * 1024? Originally I thought it should be MAX_DIR_KSIZE but looking at the value I now wonder. It does say:
define MAX_DIR_KSIZE (27651) / entry directory size limit in kibibyte (1024 byte) blocks /
and the name suggests it should be 1024 but I thought originally you noted it was not 1024. However that could have been a mistake as well.
For better or for worse, the idea of the
MAX_DIR_KSIZE
constant was to state the maximum size in 1024 byte blocks. That was the idea of the K in that constant.We could see getting rid of this confusion and just have the value be:
#define MAX_DIR_SIZE (27651*1024) /* entry directory size limit in bytes */
If the use of a constant created confusion, then perhaps the above change is best?
I don't mind either way. I did it as I thought you had said back then. Maybe I misread it or I was tired or something else. If I understand you right then I should change the previous define to be that and then update the name in txzchk so that it will be fine?
Or would you prefer adding the macro? Either way I won't get to it today but maybe I can tomorrow.
The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either * 1024 or not)?
Right now the following are done:
if (txz_info.file_sizes > MAX_DIR_KSIZE) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE) but should it be:
if (txz_info.file_sizes > MAX_DIR_KSIZE 1024) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE 1024) and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE * 1024? Maybe MAX_DIR_SIZE ?
That is an EXCELLENT question. And perhaps MAX_DIR_KSIZE
is the wrong approach?
One might be tempted to use use du -k
on the unpacked entry directory, however, The du(1)
command talks about disk utilization not file size. So for a given file system that blocks data to a certain size, the resulting unpacked IOCCC entry directory might be larger in some cases (such as on a filesystem where blocks up to the next 8K boundary).
Beyond this we would like the txzchk
tool to help us deal with huge decompression expansion BEFORE we "untar".
So the idea of the txzchk
tool was conceived. People who submit to the contests could pre-check their compressed tarball and the IOCCC Judges could pre-check the compressed tarball before unpacking the entry. We want to be fair in that the people forming the entries would be using the same size test tool (I.e., txzchk
) as the IOCCC Judges.
Now if someone has a clever hack on the xz compression algorithm (where a HUGE file compresses down to a tiny chunk of data), they might be able to form a compressed tarball that is under the compressed tarball size limit. However the tar -t
listing shows the size of the file that will be created, and so we need to pay attention to the sizes printed by tar -t
. I.e., we don't want entires that include decompression exploders that bloat the un-tarred entry directory beyond a reasonable size.
SO what algorithm should txzchk
use in determining size? Realize that this algorithm needs to be explained in a simple sentence that will go into a rule, and be understood by people for whom English is not their primary language.
To resolve your question and the other issues we have raised above, what we need to do is come up with the English sentence that will go into the next IOCCC rules that controls the maximum size of the un-tarred entry AND works using the tar -t
listing so that the compressed tarball can be checked prior to the un-tar.
The rule that covers the maximum size of a compressed tarball is simple. Something like this might do:
Your entry, when uploaded to the IOCCC submit server in the form of an XZ compressed tarball, must not be larger than XXX bytes.
However the rule that governs that txzchk
does is different question and a different Rule sentence.
There's stuff about 1K blocks and trying to avoid filesystem block incompatibilities may be creating more complexity than it's worth. Perhaps we should completely abandon the notion of block sizes and just count bytes?
The sum of the byte lengths of files in your entry (after they have been extracted from the compressed tarball) must not be greater than XXX.
That text avoids mentioning block sizes and it's just a simple sum of file lengths.
However consider the case of an entry that makes use of zero length files. You could probably put 100,000 or more such zero length files and probably compress it down to a tarball that fits under the maximum tarball size.
A zero length file still occupies space in a directory. Moreover it would be ugly to try and build a web page for such an entry with so many files.
Now a zero length file does occupy space on the disk, particularly in the directories that contains it and inode that references it. Zero length files occupy space on the disk. A du(1)
this is working properly should show that a directory with 100,000 empty files occupies a fair amount of disk space.
So does one put a limit on the number of files in an entry? Perhaps we do.
Your entry must not contain more than XXX files (this includes all directories and mandatory files).
If we did this then the txzchk
tool is simplified to mainly do this:
tar -t
and check against a maximumtar -t
and check against a maximum This is just some random thoughts that we came up with at the moment. This idea is subject to change.
Nevertheless, we think the way to answer your question and to establish a proper algorithm for txzchk
is to come up with the form of the IOCCC rules (relatively simple English sentences) and then change the tool to check the rule.
Comments suggestions and corrections welcome.
The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either 1024 or not)? Right now the following are done: if (txz_info.file_sizes > MAX_DIR_KSIZE) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE) but should it be: if (txz_info.file_sizes > MAX_DIR_KSIZE 1024) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE 1024) and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE 1024? Maybe MAX_DIR_SIZE ?
That is an EXCELLENT question. And perhaps
MAX_DIR_KSIZE
is the wrong approach?One might be tempted to use use
du -k
on the unpacked entry directory, however, Thedu(1)
command talks about disk utilization not file size. So for a given file system that blocks data to a certain size, the resulting unpacked IOCCC entry directory might be larger in some cases (such as on a filesystem where blocks up to the next 8K boundary).Beyond this we would like the
txzchk
tool to help us deal with huge decompression expansion BEFORE we "untar".So the idea of the
txzchk
tool was conceived. People who submit to the contests could pre-check their compressed tarball and the IOCCC Judges could pre-check the compressed tarball before unpacking the entry. We want to be fair in that the people forming the entries would be using the same size test tool (I.e.,txzchk
) as the IOCCC Judges.Now if someone has a clever hack on the xz compression algorithm (where a HUGE file compresses down to a tiny chunk of data), they might be able to form a compressed tarball that is under the compressed tarball size limit. However the
tar -t
listing shows the size of the file that will be created, and so we need to pay attention to the sizes printed bytar -t
. I.e., we don't want entires that include decompression exploders that bloat the un-tarred entry directory beyond a reasonable size.SO what algorithm should
txzchk
use in determining size? Realize that this algorithm needs to be explained in a simple sentence that will go into a rule, and be understood by people for whom English is not their primary language.To resolve your question and the other issues we have raised above, what we need to do is come up with the English sentence that will go into the next IOCCC rules that controls the maximum size of the un-tarred entry AND works using the
tar -t
listing so that the compressed tarball can be checked prior to the un-tar.The rule that covers the maximum size of a compressed tarball is simple. Something like this might do:
Your entry, when uploaded to the IOCCC submit server, in the form of a XZ compressed tarball, must not be larger than XXX bytes.
However the rule that governs that
txzchk
does is different question and a different Rule sentence.There's stuff about 1K blocks and trying to avoid filesystem block incompatibilities may be creating more complexity than it's worth. Perhaps we should completely abandon the notion of block sizes and just count bytes?
The sum of the byte lengths of files in your entry (after they have been extracted from the compressed tarball) must not be greater than XXX.
That text avoids mentioning block sizes and it's just a simple sum of file lengths.
However consider the case of an entry that makes use of zero length files. You could probably put 100,000 or more such zero length files and probably compress it down to a tarball that fits under the maximum tarball size.
A zero length file still occupies space in a directory. Moreover it would be ugly to try and build a web page for such an entry with so many files.
Now a zero length file does occupy space on the disk, particularly in the directories that contains it and inode that references it. Zero length files occupy space on the disk. A
du(1)
this is working properly should show that a directory with 100,000 empty files occupies a fair amount of disk space.So does one put a limit on the number of files in an entry? Perhaps we do.
Your entry must not contain more than XXX files (this includes all directories and mandatory files).
If we did this then the
txzchk
tool is simplified to mainly do this:
- Check the file length of the compressed tarball
- Some the file lengths as reported by
tar -t
- Count the number of lines as reported by
tar -t
This is just some random thoughts that we came up with at the moment. This idea is subject to change.
Nevertheless, we think the way to answer your question and to establish a proper algorithm for
txzchk
is to come up with the form of the IOCCC rules (relatively simple English sentences) and then change the tool to check the rule.Comments suggestions and corrections welcome.
Great comment as well. I will have to wait to reply until tomorrow though I'm afraid as I can barely focus my eyes. I can say though that I certainly have some ideas. Also you probably know but I tend to be really good with words so I might (depending on the circumstances) be able to help come up with ideas for use with the wording you use.
I wish I could say more now but I'm afraid I cannot do it well enough until tomorrow. I think this will be a good discussion and I'm glad I opened this issue all the more! As long as sleep is okish I should be able to reply before I have to go to the doctor and hopefully I can get some man page stuff done too.
Good day!
I'm not sure if I have enough energy right now to address comment https://github.com/ioccc-src/mkiocccentry/issues/334#issuecomment-1239788826 fully right now so what I'm going to do is rest and have a shower after that. Have some other things I have to do before going to the doctor. I hope to squeeze a good reply to this before I have to leave but if not I'll get to it tomorrow.
Hopefully it'll be today! I'm going to try resting now. I am feeling a bit sleepy though it's highly unlikely I'll get back to sleep. I'll not have the laptop up for a good while though: I suspect not for another three hours or so at the least.
The other point of interest here is: what values should be compared against this (making sure that the number is <= the max either 1024 or not)? Right now the following are done: if (txz_info.file_sizes > MAX_DIR_KSIZE) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE) but should it be: if (txz_info.file_sizes > MAX_DIR_KSIZE 1024) / ... / else if (txz_info.rounded_file_size > MAX_DIR_KSIZE 1024) and if so what do you think might be a nice name for a macro that is the MAX_DIR_KSIZE 1024? Maybe MAX_DIR_SIZE ?
Can't sleep so replying. Maybe can do some man pages after that but that depends. I was trying to work something else out first and maybe I shouldn't have bothered. Anyway I'll at least reply to the below and see what else I do later.
That is an EXCELLENT question. And perhaps
MAX_DIR_KSIZE
is the wrong approach?
With what you bring up below it might very well be.
One might be tempted to use use
du -k
on the unpacked entry directory, however, Thedu(1)
command talks about disk utilization not file size. So for a given file system that blocks data to a certain size, the resulting unpacked IOCCC entry directory might be larger in some cases (such as on a filesystem where blocks up to the next 8K boundary).
Right. Block size can change the file size. I know that an empty directory will still take space too. Also as you bring up later inodes come into play.
Beyond this we would like the
txzchk
tool to help us deal with huge decompression expansion BEFORE we "untar".
That makes sense.
So the idea of the
txzchk
tool was conceived. People who submit to the contests could pre-check their compressed tarball and the IOCCC Judges could pre-check the compressed tarball before unpacking the entry. We want to be fair in that the people forming the entries would be using the same size test tool (I.e.,txzchk
) as the IOCCC Judges.
That makes sense too.
Now if someone has a clever hack on the xz compression algorithm (where a HUGE file compresses down to a tiny chunk of data), they might be able to form a compressed tarball that is under the compressed tarball size limit. However the
tar -t
listing shows the size of the file that will be created, and so we need to pay attention to the sizes printed bytar -t
. I.e., we don't want entires that include decompression exploders that bloat the un-tarred entry directory beyond a reasonable size.
This all makes sense though I'm still curious if someone can manipulate a tarball so that the -t
shows the wrong file size as well. I imagine if anyone could do it you could do it. Might be possible with binary editors. Not sure.
SO what algorithm should
txzchk
use in determining size? Realize that this algorithm needs to be explained in a simple sentence that will go into a rule, and be understood by people for whom English is not their primary language.
I was thinking that the file_size()
function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
To resolve your question and the other issues we have raised above, what we need to do is come up with the English sentence that will go into the next IOCCC rules that controls the maximum size of the un-tarred entry AND works using the
tar -t
listing so that the compressed tarball can be checked prior to the un-tar.The rule that covers the maximum size of a compressed tarball is simple. Something like this might do:
Your entry, when uploaded to the IOCCC submit server in the form of an XZ compressed tarball, must not be larger than XXX bytes.
A possible thought on the size issue wrt the block size. Could we use:
blksize_t st_blksize; /* blocksize for file system I/O */
in some way? That's under linux; the macOS says:
u_long st_blksize;/* optimal file sys I/O ops blocksize */
and the comment in the macOS one is more descriptive. It's for optimal I/O so not a guarantee. Or maybe we can somehow use the field:
blkcnt_t st_blocks; /* blocks allocated for file */
It might even be that st_size
does not even care about block size? I don't know. However as far as block size goes if it does impact the file size we could maybe have two numbers: for example there might be a secondary number which would be a buffer zone for larger block sizes. This might not be needed though: I don't know.
If this does matter though the rule might have to be reworded. I won't try coming up with something until I have a better idea.
However the rule that governs that
txzchk
does is different question and a different Rule sentence.There's stuff about 1K blocks and trying to avoid filesystem block incompatibilities may be creating more complexity than it's worth. Perhaps we should completely abandon the notion of block sizes and just count bytes?
Ah right. Of course. So this means bytes might not be affected by the block size? I'm not sure. I thought it did but if not then I think the number of bytes would be the better way. In fact that's how it is now just I check the MAX_DIR_KSIZE
without multiplying it by 1024. Changing this would greatly simplify the rule too I would think as it's a simple value - number of bytes and nothing else (as far as size goes).
The sum of the byte lengths of files in your entry (after they have been extracted from the compressed tarball) must not be greater than XXX.
Does this mean that the tarball unpacked can be a different size limit? Any files that are compressed that would add size and those that would be a smaller size and the tarball header information would change the size but I'm not sure if this is actually considered right now. If it's not it would also have to be fixed.
That text avoids mentioning block sizes and it's just a simple sum of file lengths.
However consider the case of an entry that makes use of zero length files. You could probably put 100,000 or more such zero length files and probably compress it down to a tarball that fits under the maximum tarball size.
I would think so or certainly many files. But then the directory would be ridiculously long and could be considered (I guess there might be exceptions) abuse.
A zero length file still occupies space in a directory. Moreover it would be ugly to try and build a web page for such an entry with so many files.
The web page part especially comes to mind. I wonder if GitHub has a limit on number of files in a directory?
Now a zero length file does occupy space on the disk, particularly in the directories that contains it and inode that references it. Zero length files occupy space on the disk. A
du(1)
this is working properly should show that a directory with 100,000 empty files occupies a fair amount of disk space.
True.
So does one put a limit on the number of files in an entry? Perhaps we do.
Your entry must not contain more than XXX files (this includes all directories and mandatory files).
Question is how many files. The old way was 20 and this was often a burden to me because I had a lot of supplementary files. Now I guess I could have used a tarball like I guess Dave Burton did in 2018 but I didn't know if this would break the rule so I never risked it. I figured I could add files later or in one case I had a script generate the other files.
If we did this then the
txzchk
tool is simplified to mainly do this:
- Check the file length of the compressed tarball
- Sum the file lengths as reported by
tar -t
and check against a maximum- Count the number of lines as reported by
tar -t
and check against a maximum
I think it would actually not be so simple. There are many checks that you had me put in that still would apply. Safe file names, correct dot files etc. Or do you mean this would be simplified in the sense of file size and (a new constant) number of files?
I would think that the directory (and only one allowed so maybe only the correct directory) should not count against the limit since that's required by the rules and not a regular file.
This is just some random thoughts that we came up with at the moment. This idea is subject to change.
Of course.
Nevertheless, we think the way to answer your question and to establish a proper algorithm for
txzchk
is to come up with the form of the IOCCC rules (relatively simple English sentences) and then change the tool to check the rule.
If you come up with the numbers would you like me to try some wording too? I'd be happy to do so.
Comments suggestions and corrections welcome.
Thank you for this! I consider it a real honour and privilege that you care about my opinions even about limits. Well and that you care about me as a person - but this is about the contest.
This all makes sense though I'm still curious if someone can manipulate a tarball so that the -t shows the wrong file size as well. I imagine if anyone could do it you could do it. Might be possible with binary editors. Not sure.
Well if someone does something that is both fun and cleaver (instead of annoying), we might given them an abuse of the rules award and then adjust the rules to close down such a loophole. :-)
I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
Sounds like a simple
total_size += file_length;
is all that is needed, instead of some function that has complexity ... just our opinion.
Given our comment 1242821987 we retract the comment above. A sum function is needed.
A possible thought on the size issue wrt the block size. ... This might not be needed though: I don't know.
We don't want to create filesystem independent rules. Just sum file lengths (NOT st_blocks as files can have holes) and be done with it.
Handle zero length files and tiny files by placing a rational limit on the number of files. Let those who need more files to use a tarball.
Does this mean that the tarball unpacked can be a different size limit? Any files that are compressed that would add size and those that would be a smaller size and the tarball header information would change the size but I'm not sure if this is actually considered right now. If it's not it would also have to be fixed.
The tarball (xz compressed) as a size limit. The sum of the lengths of the files in that tarball will have a limit. The number of files in the tarball will have a limit.
I think it would actually not be so simple. There are many checks that you had me put in that still would apply. Safe file names, correct dot files etc. Or do you mean this would be simplified in the sense of file size and (a new constant) number of files?
Well the txzchk
needs to be well written, check for libc errors .. as is the case for other code in this repo.
There are rules about filenames and rules about directory paths, etc. Yes, txzchk
needs to help check a number of things beyond sizes and number of files.
This all makes sense though I'm still curious if someone can manipulate a tarball so that the -t shows the wrong file size as well. I imagine if anyone could do it you could do it. Might be possible with binary editors. Not sure.
Well if someone does something that is both fun and cleaver (instead of annoying), we might given them an abuse of the rules award and then adjust the rules to close down such a loophole. :-)
Certainly. It could theoretically even be me but the problem is that I am like most programmers and don’t like bugs in my code so if I spotted an issue I would probably want to solve it. Still it sounds kind of like a fun and funny idea!
Otoh I suspect that if someone does do this you would like me to fix it and of course I would be honoured!
I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
Sounds like a simple
total_size += file_length;
is all that is needed, instead of some function that has complexity ... just our opinion.
I believe it actually does this but there is more than one size to keep track of?
A possible thought on the size issue wrt the block size. ... This might not be needed though: I don't know.
We don't want to create filesystem independent rules. Just sum file lengths (NOT st_blocks as files can have holes) and be done with it.
Handle zero length files and tiny files by placing a rational limit on the number of files. Let those who need more files to use a tarball.
Agree with this. These were just quick thoughts on the problem and not a suggestion one way or another.
And good point on file holes. And what about - well my tired head can’t think of the kind of file but it’s where they can appear really big but actually the content is not that big. What am I thinking of? It’s going to bug me not being able to think of the term though it will very possibly pop into my head when trying to go to sleep.
It would be funny if it’s actually hole but I don’t think it’s that for whatever reason.
Does this mean that the tarball unpacked can be a different size limit? Any files that are compressed that would add size and those that would be a smaller size and the tarball header information would change the size but I'm not sure if this is actually considered right now. If it's not it would also have to be fixed.
The tarball (xz compressed) as a size limit. The sum of the lengths of the files in that tarball will have a limit. The number of files in the tarball will have a limit.
Which we need to decide upon and possibly (afterwards) discuss how it might go.
I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
Sounds like a simple
total_size += file_length;
is all that is needed, instead of some function that has complexity ... just our opinion.
I believe it actually does this but there is more than one size to keep track of?
No .. only one total file length size for a entry.
Given our comment 1242821987 we retract the comment above. A sum function is needed.
I think it would actually not be so simple. There are many checks that you had me put in that still would apply. Safe file names, correct dot files etc. Or do you mean this would be simplified in the sense of file size and (a new constant) number of files?
Well the
txzchk
needs to be well written, check for libc errors .. as is the case for other code in this repo.
It already does doesn’t it? Typing on the phone so can’t easily check but pretty sure I did. I certainly checked for NULL pointers and free memory etc.
Were you thinking of something specific I missed?
There are rules about filenames and rules about directory paths, etc. Yes,
txzchk
needs to help check a number of things beyond sizes and number of files.
Right. And it does quite a few checks. It also has a pretty extensive report (depending on verbosity level) at the end though I know I failed to add the most recent checks as I wanted to get the fixes in.
I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
Sounds like a simple
total_size += file_length;
is all that is needed, instead of some function that has complexity ... just our opinion.
I believe it actually does this but there is more than one size to keep track of?
No .. only one total file length size for a entry.
Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output?
What should it do instead and what macro should be used (max size I mean)?
I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
Sounds like a simple
total_size += file_length;
is all that is needed, instead of some function that has complexity ... just our opinion.
I believe it actually does this but there is more than one size to keep track of?
No .. only one total file length size for a entry.
Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output?
What should it do instead and what macro should be used (max size I mean)?
No rounding needed when file blocking is ignored (which it should be) .. just file length sum, size of the tarball, and number of files in the tarball in terns of the so-called size rules.
I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
Sounds like a simple
total_size += file_length;
is all that is needed, instead of some function that has complexity ... just our opinion.
I believe it actually does this but there is more than one size to keep track of?
No .. only one total file length size for a entry.
Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output? What should it do instead and what macro should be used (max size I mean)?
No rounding needed when file blocking is ignored (which it should be) .. just file length sum, size of the tarball, and number of files in the tarball in terns of the so-called size rules.
So should the rounded up to nearest multiple of 1024 be removed ?
I was thinking that the file_size() function and the tar output listing (sum) might be it. But now I think on it I wonder if for example the user system had a different block size whether they would be able to submit the same size tarball - same size as compared to those with a different block size?
Sounds like a simple
total_size += file_length;
is all that is needed, instead of some function that has complexity ... just our opinion.
I believe it actually does this but there is more than one size to keep track of?
No .. only one total file length size for a entry.
Hmm okay. But I thought there is the rounded size, the tarball size and the total size of all files summed from each line in the tar output? What should it do instead and what macro should be used (max size I mean)?
No rounding needed when file blocking is ignored (which it should be) .. just file length sum, size of the tarball, and number of files in the tarball in terns of the so-called size rules.
So should the rounded up to nearest multiple of 1024 be removed ?
See commit 229c0c1faa455469474e783c8160c3f6d6310cab
The use of MAX_DIR_KSIZE
should be removed in this repo.
Code that used MAX_DIR_KSIZE
(both mkiocccentry
and txzchk
) should instead use and test against the new MAX_SUM_FILELEN
and MAX_FILE_COUNT
values.
No rounding needed.
The MAX_FILE_COUNT
is a file count for all files (including the required 5). Anything that is NOT a file should NOT be counted with respect to this new constant.
This value has NOT been discussed by the IOCCC judges and is thus highly subject to change. Nevertheless we know MAX_FILE_COUNT
will be > 5
and < infinity
:-)
When MAX_DIR_KSIZE
is no longer used, it should be removed the limit_ioccc.h
.
The use of
MAX_DIR_KSIZE
should be removed in this repo.Code that used
MAX_DIR_KSIZE
(bothmkiocccentry
andtxzchk
) should instead use and test against the newMAX_SUM_FILELEN
andMAX_FILE_COUNT
values.
Decided to quickly reply so I have something for tomorrow morning.
Would you please tell me what values should be compared to which macros?
Thanks you! I will get to it soon. But first a long sleep. Cheers!
The use of
MAX_DIR_KSIZE
should be removed in this repo. Code that usedMAX_DIR_KSIZE
(bothmkiocccentry
andtxzchk
) should instead use and test against the newMAX_SUM_FILELEN
andMAX_FILE_COUNT
values.Decided to quickly reply so I have something for tomorrow morning.
Would you please tell me what values should be compared to which macros?
Thanks you! I will get to it soon. But first a long sleep. Cheers!
Oh in this case: should it be <= or <? I believe because it’s max it should be the former but without looking at it I want to be sure.
Sleep time for me though I will be awake a while yet but until I lie down that process won’t start.
Sleep well when you do and welcome home! Btw am I correct that NASA had another problem with that rocket ? I hope it’s worked out soon and I am sorry you couldn’t be there.
Years ago when I was a kid they (not NASA but some rocket company) regularly did tests round here. Iirc it was Friday mornings and it was incredibly annoying. They might have even had a nuclear mess (certainly some company did here but I am not sure if it was the same company without looking). Anyway I don’t mean to write this here but I am trying to hurry off to bed.
Well more from me tomorrow. Good night!
The use of
MAX_DIR_KSIZE
should be removed in this repo. Code that usedMAX_DIR_KSIZE
(bothmkiocccentry
andtxzchk
) should instead use and test against the newMAX_SUM_FILELEN
andMAX_FILE_COUNT
values.Decided to quickly reply so I have something for tomorrow morning.
Would you please tell me what values should be compared to which macros?
Thanks you! I will get to it soon. But first a long sleep. Cheers!
Does comment 1242819402 answer that?
Oh in this case: should it be <= or <? I believe because it’s max it should be the former but without looking at it I want to be sure.
As these are MAX (i.e., limits) values <=
are OK (assuming of course, they are integers AND they are not negative).
Consider the test_txzchk/good/entry.12345678-1234-4321-abcd-1234567890ab-2.1924343546.txt
file:
drwxr-xr-x 0 501 20 0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
The file count is 9 (and this is currently <= MAX_FILE_COUNT
).
The sum of the file lengths is 18836 (and this is currently <= MAX_SUM_FILELEN
).
Now lets consider some malformed tar listings ... ignore how such a listing might arise, just assume somehow this happens:
drwxr-xr-x 0 501 20 1000 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
The sum of the file lengths is still 18836, even though the directory is size is 1000. Only the sum of files matter.
Assume somehow this happens:
drwxr-xr-x 0 501 20 1000 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
-rw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
drw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/
Now the the file lengths is just 18832 because the lengths of the directories do not count towards the sum.
Yes, this entry would be rejected because of the sub-directory too, but that issue is beyond the scope of this comment.
Assume somehow this happens:
drwxr-xr-x 0 501 20 1000 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r-- 0 501 20 4.0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r-- 0 501 20 -61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r-- 0 501 20 155a Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
drw-r--r-- 0 501 20 fred Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/
This entry should be rejected because the length of extra2
is 4.0 which is not an integer, because the length of prog.c
is negative and the length of extra1
is not a number, and because the length of .info.json
is NOT a base 10 integer, and because -foo
filename starts with an invalid character, etc.
Nevertheless and focusing on the topic of this comment, only these lengths would be summed with respect to the MAX_SUM_FILELEN
value:
-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
because only those files have a base 10 integer length that is not negative.
And nevertheless and focusing on the topic of this comment, only these files should be counted with respect to the MAX_FILE_COUNT
value:
-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
-rw-r--r-- 0 501 20 4.0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
-rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo
-rw-r--r-- 0 501 20 -61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
-rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
-rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
-rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
-rw-r--r-- 0 501 20 155a Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
because only those are files.
BTW, consider this fictional listing:
drwxr-xr-x 0 501 20 0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/
crw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile
brw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2
lrw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo
prw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c
srw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json
wrw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md
Srw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
Lrw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
-rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
While the entry should be rejected a number of reasons, with respect to MAX_SUM_FILELEN
the sum is 4 and with respect to MAX_FILE_COUNT
the file count is 1.
Yes. txzchk
and mkiocccentry
should reject such files / reject such a tarball for various reasons. For the purpose of MAX_FILE_COUNT
, ONLY files matter. For the sum of file lengths for MAX_SUM_FILELEN
, ONLY non-negative integer base 10 lengths of files count towards the sum.
Here is some pseudo-C code for a paranoid file length sum:
code example removed in favor of commit 4857137ad46d004ba9e20b0b41dd8820b2c2dc0c
BTW: Such careful numeric processing comes from years of experience in writing new largest know prime finding computation code where there is NO tolerance for errors. The key is defense in depth with a rational level of code paranoia AND to make it intractable for bogus data to produce fool the count or file length sum into looking like it is valid.
There are several rather subtile aspects in the code committed below: most are the are very intentional .. except for any typos or bugs. :-) For example, we attempt to make it much harder for a stack smash to allow an invalid count or sum to pass.
p.s. We retract comment 1242812163 and comment 1242817660 based on this comment. A sum function IS needed.
See commit 4857137ad46d004ba9e20b0b41dd8820b2c2dc0c
Example usage:
/* ... static values private to some .c file (outside of any function) ... */
static intmax_t sum_check;
static intmax_t count_check;
/* ... at start of function that is checking the total file length sum and count ... */
intmax_t sum = 0;
intmax_t count = 0;
intmax_t length = 0;
bool test = false;
/* ... loop the following over ALL files where length_str is the length of the current file ... */
/*
* convert tarball file length string into a value to sum
*/
test = string_to_intmax2(length_str, &length);
if (test == false) {
... object to a bogus file length string ...
}
/*
* carefully sum and count this file's length
*/
if (length < 0) {
... object to a negative file length ...
}
test = sum_and_count(length, &sum, &count, &sum_check, &count_check);
if (test == false) {
... object to internal/computational error ...
}
if (sum < 0) {
... object to negative total file length ...
}
if (sum > MAX_SUM_FILELEN) {
... object to sum of all file lengths being too large ...
}
if (count < 0) {
... object to a negative file count ...
}
if (count == 0) {
... object to a zero file count ...
}
if (count > MAX_FILE_COUNT) {
... object to too many files ...
}
Of course, for code such as mkiocccentry
, where you have the file length as an integer, the call to string_to_intmax2() can be skipped.
BTW, consider this fictional listing:
drwxr-xr-x 0 501 20 0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/ crw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile brw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 lrw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo prw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c srw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json wrw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md Srw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar Lrw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
While the entry should be rejected a number of reasons, with respect to
MAX_SUM_FILELEN
the sum is 4 and with respect toMAX_FILE_COUNT
the file count is 1.Yes.
txzchk
andmkiocccentry
should reject such files / reject such a tarball for various reasons. For the purpose ofMAX_FILE_COUNT
, ONLY files matter. For the sum of file lengths forMAX_SUM_FILELEN
, ONLY non-negative integer base 10 lengths of files count towards the sum.
The fact I cannot see why the tools should reject those files is telling me that I should not work on this today - or not now at least. Sorry! I hope I will feel more able to reply tomorrow. I hope you have a good day!
(EDIT: Hours later I see it ... a quick glance prevented it when I was still not very awake and with a quick glance later I saw it immediately.)
As I said elsewhere Tuesday I will be unable to do much of anything but hopefully tomorrow I should be able to do some things (including maybe work on the new issue you opened based on my comment in another thread). Tomorrow I do have a zoom meeting but that's all that's scheduled.
--
I should be able to reply to any replies to my email today though .. depending on what time the replies come in.
Going to do something else. Maybe I'll be able to focus more in a bit. I hope so.
I'll try replying to some of this anyway. Not sure it'll be complete today though.
Consider the
test_txzchk/good/entry.12345678-1234-4321-abcd-1234567890ab-2.1924343546.txt
file:drwxr-xr-x 0 501 20 0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/ -rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 -rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo -rw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c -rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json -rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md -rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar -rw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
The file count is 9 (and this is currently
<= MAX_FILE_COUNT
).The sum of the file lengths is 18836 (and this is currently
<= MAX_SUM_FILELEN
).
Right.
Now lets consider some malformed tar listings ... ignore how such a listing might arise, just assume somehow this happens:
Good idea to have these examples and perhaps once it's all resolved they should be in the bad subdirectory!
drwxr-xr-x 0 501 20 1000 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/ -rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 -rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo -rw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c -rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json -rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md -rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar -rw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
The sum of the file lengths is still 18836, even though the directory is size is 1000. Only the sum of files matter.
So about that. What is the size of the directory? With the following files:
$ ls -al test
total 8
drwxr-xr-x 3 cody staff 96 Sep 11 12:51 ./
drwxr-xr-x 270 cody staff 8640 Sep 11 12:49 ../
-rw-r--r-- 1 cody staff 10 Sep 11 12:51 test
the number of blocks used of all the files used in that directory is 8. But if for example (under macOS) I tar the directory like so:
$ tar cvf test.tar test
a test
a test/test
and then list the contents:
$ tar fvt test.tar
drwxr-xr-x 0 cody staff 0 Sep 11 12:51 test/
-rw-r--r-- 0 cody staff 10 Sep 11 12:51 test/test
I see the directory size of test
is 0. So what does that mean? How can it be 0 when there are blocks being used? I know with ls one can change BLOCKSIZE
via one or more options and the environmental variable itself. But still why should that be 0?
Assume somehow this happens:
drwxr-xr-x 0 501 20 1000 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/ -rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 -rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo -rw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c -rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json -rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md -rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar -rw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json drw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/
Now the the file lengths is just 18832 because the lengths of the directories do not count towards the sum.
In other words because the sum of all the files found in the tar listing results in that value, right? (I haven't tried it - it's hard to focus right now but trying to get some discussion going).
Yes, this entry would be rejected because of the sub-directory too, but that issue is beyond the scope of this comment.
Assume somehow this happens:
drwxr-xr-x 0 501 20 1000 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/ -rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile -rw-r--r-- 0 501 20 4.0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 -rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo -rw-r--r-- 0 501 20 -61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c -rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json -rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md -rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar -rw-r--r-- 0 501 20 155a Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json drw-r--r-- 0 501 20 fred Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1/
This entry should be rejected because the length of
extra2
is 4.0 which is not an integer, because the length ofprog.c
is negative and the length ofextra1
is not a number, and because the length of.info.json
is NOT a base 10 integer, and because-foo
filename starts with an invalid character, etc.
Plus 'fred' being there. But I actually wonder how this output would go with the tool now. I don't know. I know the negative sizes will change the total size but it seems like this might be in need of change, based on some of the comments (not sure if it's this one or another or more than one).
Nevertheless and focusing on the topic of this comment, only these lengths would be summed with respect to the
MAX_SUM_FILELEN
value:-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile -rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo -rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json -rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md -rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar
because only those files have a base 10 integer length that is not negative.
So what should be done with the invalid lines ? Certainly they should count against the entry but in what way? I see you wrote a function (that I've not had time to look at) so maybe this will do what I need but having clarity here would also be good please.
And nevertheless and focusing on the topic of this comment, only these files should be counted with respect to the
MAX_FILE_COUNT
value:-rw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile -rw-r--r-- 0 501 20 4.0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 -rw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/-foo -rw-r--r-- 0 501 20 -61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c -rw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json -rw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md -rw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar -rw-r--r-- 0 501 20 155a Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json
because only those are files.
Regular files. Yes. I had thought of that earlier - because I thought you said in some comment that directories would count too. I found that strange but maybe I misread it or the thought/idea was changed.
BTW, consider this fictional listing:
drwxr-xr-x 0 501 20 0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/ crw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile brw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 lrw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo prw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c srw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json wrw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md Srw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar Lrw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
While the entry should be rejected a number of reasons, with respect to
MAX_SUM_FILELEN
the sum is 4 and with respect toMAX_FILE_COUNT
the file count is 1.Yes.
txzchk
andmkiocccentry
should reject such files / reject such a tarball for various reasons. For the purpose ofMAX_FILE_COUNT
, ONLY files matter. For the sum of file lengths forMAX_SUM_FILELEN
, ONLY non-negative integer base 10 lengths of files count towards the sum.
What am I missing here? What should be rejected? At a quick glance anyway they seem to be the usual. Probably will be obvious later on when looking again tomorrow or after you point out something that's right in front of me.
(EDIT: No need to answer this .. looking at the actual comment again and it immediately was visible what is wrong with these.)
UPDATE 0a:
Here is some pseudo-C code for a paranoid file length sum:
code example removed in favor of commit 4857137
UPDATE 1a:
BTW: Such careful numeric processing comes from years of experience in writing new largest know prime finding computation code where there is NO tolerance for errors. The key is defense in depth with a rational level of code paranoia AND to make it intractable for bogus data to produce fool the count or file length sum into looking like it is valid.
I imagine so! (And I hope you picked up on the pun :-) )
There are several rather subtile aspects in the code committed below: most are the are very intentional .. except for any typos or bugs. :-) For example, we attempt to make it much harder for a stack smash to allow an invalid count or sum to pass.
Just to be clear. Since you seem to have rolled back some ideas. The code you refer to in the part of the comment this is replying to - that still is in the repo, right?
p.s. We retract comment 1242812163 and comment 1242817660 based on this comment. A sum function IS needed.
UPDATE 2a:
See commit 4857137
Example usage:
/* ... static values private to some .c file (outside of any function) ... */ static intmax_t sum_check; static intmax_t count_check; /* ... at start of function that is checking the total file length sum and count ... */ intmax_t sum = 0; intmax_t count = 0; intmax_t length = 0; bool test = false; /* ... loop the following over ALL files where length_str is the length of the current file ... */ /* * convert tarball file length string into a value to sum */ test = string_to_intmax2(length_str, &length); if (test == false) { ... object to a bogus file length string ... } /* * carefully sum and count this file's length */ if (length < 0) { ... object to a negative file length ... } test = sum_and_count(length, &sum, &count, &sum_check, &count_check); if (test == false) { ... object to internal/computational error ... } if (sum < 0) { ... object to negative total file length ... } if (sum > MAX_SUM_FILELEN) { ... object to sum of all file lengths being too large ... } if (count < 0) { ... object to a negative file count ... } if (count == 0) { ... object to a zero file count ... } if (count > MAX_FILE_COUNT) { ... object to too many files ... }
Of course, for code such as
mkiocccentry
, where you have the file length as an integer, the call to string_to_intmax2() can be skipped.
Of course (on the not using that function).
But just to be clear: these checks should be added after all lines have been parsed, right? If so that should not be a problem as it'll all be stored in struct txz_line
by a linked list txz_lines
. I could just iterate through them all and flag any issues via the struct txz_info
.
In order for me to really get into this though I'll have to be in a better state. I'm afraid that's probably all I can do today.
That being said I will ask you finally: since this has to be done for mkiocccentry - where would it be done? I mean some of it - like counting the files - is done indirectly via txzchk. So maybe worded better is: what parts need to be added to mkiocccentry too and where ?
BTW, consider this fictional listing:
drwxr-xr-x 0 501 20 0 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/ crw-r--r-- 0 501 20 1854 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/Makefile brw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra2 lrw-r--r-- 0 501 20 2815 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/foo prw-r--r-- 0 501 20 61 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/prog.c srw-r--r-- 0 501 20 2859 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.author.json wrw-r--r-- 0 501 20 4454 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/remarks.md Srw-r--r-- 0 501 20 5235 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/bar Lrw-r--r-- 0 501 20 1550 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/.info.json -rw-r--r-- 0 501 20 4 Jun 4 04:52 12345678-1234-4321-abcd-1234567890ab-2/extra1
Oh! I see now. Looking at the actual comment makes it easier. It's non-regular files. These are already checked so should be fine.
Actually the way it is done even allows for that by some crazy chance a new type is created as it checks for just valid chars via strspn()
. So it's safe from some change to POSIX or some bogus implementation of tar / whatever.
So about that. What is the size of the directory? With the following files:
From the Rule 2 perspective, we don't care. A proper IOCCC entry will just have a single directory as sub-directories in the source are NOT allowed. This isn't to say that there can never be a sub-directory. An IOCCC entry is free to create sub-directories via their Makefile
or via their program, etc. But from the perspective of the XZ compressed tarball, there is ONLY one directory. As Rule 2 will focus on files in that one directory, the space that the one directory occupies is NOT considered in the size.
An advantage of ignoring the size of the one directory is that we avoid filesystem specific directory issues.
So ignore the one directory, both in terms of count and the sum of the file lengths, because it is a directory.
To prevent someone from grossly abusing the one directory and filling it up with zero length files, we limit the number of files to MAX_FILE_COUNT
. So again, the one directory can be ignored.
So what should be done with the invalid lines ? Certainly they should count against the entry but in what way? I see you wrote a function (that I've not had time to look at) so maybe this will do what I need but having clarity here would also be good please.
There is more than one reason to reject an entry. :-)
From the sum_and_count()
function perspective, if the tar listing is a file, sum and count it, otherwise ignore it from that function's perspective.
Of course a bogus filename, 2nd directory, a directory that is NOT a top level directory, duplicate filenames, something that is NOT a file nor the top level directory, malformed tar listing lines, lines that have a username / group name instead of a UID/ GID, etc. All of these are reasons for txzchk
to reject the entry. Just perhaps NOT for Rule 2 reasons. :-)
A similar rule applies to mkiocccentry
, but here the tool is forming a directory and dealing with files to copy into that directory. True, mkiocccentry
will run txzchk
in the XZ compressed tarball that it formed (as a sanity check), but similar checking from Rule 2 and similar checking for that other stuff applies.
FYI: Rule 2 used to focus only on the size of prog.c and iocccsize
stuff. The txzchk
tool doesn't concern itself with the size of prog.c
especially as the user may have requested a rule_2a_override
or rule_2b_override
. However their won't be a Rule 2c nor Rule 2d, nor etc. override so they will not be able to submit a XZ compressed tarball larger than MAX_TARBALL_LEN
, nor a sum if file lengths larger than MAX_SUM_FILELEN
, nor more than MAX_FILE_COUNT
files.
Plus 'fred' being there. But I actually wonder how this output would go with the tool now. I don't know.
Some of these listing ideas should be put under test_txzchk/bad/
for testing purposes.
But just to be clear: these checks should be added after all lines have been parsed, right? If so that should not be a problem as it'll all be stored in
struct txz_line
by a linked listtxz_lines
. I could just iterate through them all and flag any issues via thestruct txz_info
.
It is up to you as far as how you want txzchk
to handle it. Just as long as all tarball file listing field strings are processed by string_to_intmax2()
to attempt to get a file length (or fail because it is something like fred or 3.0 or 123a or 0123) AND checked if the file length if > 0
AND for those that are, passed to sum_and_count()
for summing and counting. How do that is up to you so long and all files in the tar listing are processed.
In order for me to really get into this though I'll have to be in a better state. I'm afraid that's probably all I can do today.
Best wishes on your state change for the better!
That being said I will ask you finally: since this has to be done for mkiocccentry - where would it be done? I mean some of it - like counting the files - is done indirectly via txzchk. So maybe worded better is: what parts need to be added to mkiocccentry too and where ?
Well in the case of mkiocccentry
you are NOT dealing with a tar listing, but rather the file length from a stat(2)
call. So there isn't a need to call string_to_intmax2()
. Just pass the st_size
file directly to sum_and_count()
when the time is a file (i.e., when (st_mode&S_IFMT) == S_IFREG is true).
Yes, mkiocccentry
should sum and count and check the result. If the sum or count is exceeded, issue an error and decline to form a compressed tarball, just as if it was given a bogus filename, or a file that does not exist, or a directory, or some special non-file/non-directory, etc.
Just to be clear. Since you seem to have rolled back some ideas. The code you refer to in the part of the comment this is replying to - that still is in the repo, right?
Correct.
So about that. What is the size of the directory? With the following files:
From the Rule 2 perspective, we don't care. A proper IOCCC entry will just have a single directory as sub-directories in the source are NOT allowed. This isn't to say that there can never be a sub-directory. An IOCCC entry is free to create sub-directories via their
Makefile
or via their program, etc. But from the perspective of the XZ compressed tarball, there is ONLY one directory. As Rule 2 will focus on files in that one directory, the space that the one directory occupies is NOT considered in the size.
Right. I didn't think it counted against files - would not seem fair even. But this makes me wonder if I actually do include the directory size as part of the sum. I'll have to check that one.
What if it's a subdirectory? Of course it'll be rejected but what should the action be as far as size goes? Also I was more generally asking what the directory size is supposed to mean. I guess it depends on the block size too but I'm not sure what it's supposed to mean as clearly the example I gave took more than 0 bytes and the directory size was reported as 0.
An advantage of ignoring the size of the one directory is that we avoid filesystem specific directory issues.
Of course.
So ignore the one directory, both in terms of count and the sum of the file lengths, because it is a directory.
Well that goes back to the other thought I had - if we ignore one directory what about the others? And do we decide which one to ignore? I would think the correct directory for the entry based on fnamchk
though of course if that tool fails we don't have that information available.
To prevent someone from grossly abusing the one directory and filling it up with zero length files, we limit the number of files to
MAX_FILE_COUNT
. So again, the one directory can be ignored.
Right. But what about other directories? It'll be rejected as an invalid entry but what should be done with it as far as reporting? I like to be complete as you know!
So what should be done with the invalid lines ? Certainly they should count against the entry but in what way? I see you wrote a function (that I've not had time to look at) so maybe this will do what I need but having clarity here would also be good please.
There is more than one reason to reject an entry. :-)
That's true.
From the
sum_and_count()
function perspective, if the tar listing is a file, sum and count it, otherwise ignore it from that function's perspective.
As above what if it's not the expected directory? Do I sum those up? What about files inside those subdirectories ?
Of course a bogus filename, 2nd directory, a directory that is NOT a top level directory, duplicate filenames, something that is NOT a file nor the top level directory, malformed tar listing lines, lines that have a username / group name instead of a UID/ GID, etc. All of these are reasons for
txzchk
to reject the entry. Just perhaps NOT for Rule 2 reasons. :-)
Which of course it does already for all of these though as far as malformed lines I did put in the BUGS
section of the man page to please report it if someone comes across a format we're unfamiliar with. Obviously if they're doing this to abuse it'll be rejected as a request. But then this brings up an interesting idea I just had:
What if someone prior to the contest opening (or during) suggests that they found a new format that's not actually real and they m mean to abuse the adding of it somehow? Is this a concern from the IOCCC judging perspective?
A similar rule applies to
mkiocccentry
, but here the tool is forming a directory and dealing with files to copy into that directory. True,mkiocccentry
will runtxzchk
in the XZ compressed tarball that it formed (as a sanity check), but similar checking from Rule 2 and similar checking for that other stuff applies.
You mean that the mkiocccentry also does these tests? Of course as far as duplicate files go I would think that it would not be possible the way the mkiocccentry does it since it would overwrite the files each time. At least I would think so - I haven't looked at how it's done in a long while.
FYI: Rule 2 used to focus only on the size of prog.c and
iocccsize
stuff. Thetxzchk
tool doesn't concern itself with the size ofprog.c
especially as the user may have requested arule_2a_override
orrule_2b_override
. However their won't be a Rule 2c nor Rule 2d, nor etc. override so they will not be able to submit a XZ compressed tarball larger thanMAX_TARBALL_LEN
, nor a sum if file lengths larger thanMAX_SUM_FILELEN
, nor more thanMAX_FILE_COUNT
files.
Yes. But by 'used to focus on' .. do you mean that it will focus on other things now too? If so that would require a change in a number of tools. I actually from an earlier comment you made wondered if this is happening. Or did I misunderstand?
Plus 'fred' being there. But I actually wonder how this output would go with the tool now. I don't know.
Some of these listing ideas should be put under
test_txzchk/bad/
for testing purposes.
I had the same idea - but only after anything needing fixing is fixed - else it would break make test.
But just to be clear: these checks should be added after all lines have been parsed, right? If so that should not be a problem as it'll all be stored in
struct txz_line
by a linked listtxz_lines
. I could just iterate through them all and flag any issues via thestruct txz_info
.It is up to you as far as how you want
txzchk
to handle it. Just as long as all tarball file listing field strings are processed bystring_to_intmax2()
to attempt to get a file length (or fail because it is something like fred or 3.0 or 123a or 0123) AND checked if the file length if> 0
AND for those that are, passed tosum_and_count()
for summing and counting. How do that is up to you so long and all files in the tar listing are processed.
Of course. It was kind of a question too though. Previously (I think) each time a new size is parsed it is added to the total. However now it should use this new set of code (and perhaps some of it - the tests - should be put in a separate function .. question is does it belong in txzchk
or could it be used in other tools so maybe better fit in util.c
?) so I wonder if maybe it has to wait until all lines are parsed. I'm not sure how the function works yet.
Then again I'll have to look at how I have it working now and I am unfortunately too tired to really do anything with it today :(
Okay looking at that function briefly I see it takes a pointer to a previous size so I can just use it each time I encounter a new file. That means I don't have to wait until after all lines are parsed. Much of it can stay the same but the difference is that I now have to use the new functions and tests in the code rather than the tests I have. Then depending on the result I can flag issues. Some of the issues might need to be detected differently though - not sure yet.
In order for me to really get into this though I'll have to be in a better state. I'm afraid that's probably all I can do today.
Best wishes on your state change for the better!
Thank you! Unfortunately I woke up way too early again so I'm not sure what I can do today but at least the discussion should be able to continue.
That being said I will ask you finally: since this has to be done for mkiocccentry - where would it be done? I mean some of it - like counting the files - is done indirectly via txzchk. So maybe worded better is: what parts need to be added to mkiocccentry too and where ?
Well in the case of
mkiocccentry
you are NOT dealing with a tar listing, but rather the file length from astat(2)
call. So there isn't a need to callstring_to_intmax2()
. Just pass thest_size
file directly tosum_and_count()
when the time is a file (i.e., when (st_mode&S_IFMT) == S_IFREG is true).
Of course. But that means for each file processed it should kind of do the same thing as txzchk
will do only not from a string but an actual int. This suggests to me (if I follow you right) that this should be a new set of functions in util.c
. Want me to write these and then integrate them into both tools? I can do that though I don't know if that'll be today or else (at the earliest as most likely tomorrow is completely shot) Wednesday.
Yes,
mkiocccentry
should sum and count and check the result. If the sum or count is exceeded, issue an error and decline to form a compressed tarball, just as if it was given a bogus filename, or a file that does not exist, or a directory, or some special non-file/non-directory, etc.
Right.
Just to be clear. Since you seem to have rolled back some ideas. The code you refer to in the part of the comment this is replying to - that still is in the repo, right?
Correct.
Thanks. Though of course I no longer know what the code is! But I don't think I need to as the function names are above.
FYI: I'm waiting on hearing back on the above before I make the fixes. I'm not sure if this is necessary or not but I'd rather only have to do it once.
So about that. What is the size of the directory? With the following files:
From the Rule 2 perspective, we don't care. A proper IOCCC entry will just have a single directory as sub-directories in the source are NOT allowed. This isn't to say that there can never be a sub-directory. An IOCCC entry is free to create sub-directories via their
Makefile
or via their program, etc. But from the perspective of the XZ compressed tarball, there is ONLY one directory. As Rule 2 will focus on files in that one directory, the space that the one directory occupies is NOT considered in the size.Right. I didn't think it counted against files - would not seem fair even. But this makes me wonder if I actually do include the directory size as part of the sum. I'll have to check that one.
What if it's a subdirectory? Of course it'll be rejected but what should the action be as far as size goes? Also I was more generally asking what the directory size is supposed to mean. I guess it depends on the block size too but I'm not sure what it's supposed to mean as clearly the example I gave took more than 0 bytes and the directory size was reported as 0.
An advantage of ignoring the size of the one directory is that we avoid filesystem specific directory issues.
Of course.
So ignore the one directory, both in terms of count and the sum of the file lengths, because it is a directory.
Well that goes back to the other thought I had - if we ignore one directory what about the others? And do we decide which one to ignore? I would think the correct directory for the entry based on
fnamchk
though of course if that tool fails we don't have that information available.To prevent someone from grossly abusing the one directory and filling it up with zero length files, we limit the number of files to
MAX_FILE_COUNT
. So again, the one directory can be ignored.Right. But what about other directories? It'll be rejected as an invalid entry but what should be done with it as far as reporting? I like to be complete as you know!
Sub-directories, not being files, do not count nor do they sum.
Files, even those in sub-directories, are files and therefore they count and sum.
Files count and sum. Non-files don't.
I'm opening this issue so that any issues that come up will not be lost in the other thread that's more OT stuff than anything else (or equally OT). I will copy paste any issues I'm aware of that I can easily access (which might depend on if Mail wants to load them properly as unfortunately GitHub makes it hard to find messages in long threads as they at different intervals don't load comments so have to scroll through long pages to find the right links .. I know what to search for I just don't know if I can get to them easily). Anyway that's what this issue is for.
If you would please assign it to me that would be great though maybe you'll want to assign it to you too since you discuss it as well: I leave that to you of course. One comment coming up shortly.