Rewrite common.c: isdec*()

solardiz commented 6 years ago

On invalid input, the isdec*() functions in common.c may cause signed integer overflow, then try to post-check for it, but the overflow itself is already UB. Those functions should be rewritten to manually work on the strings of digits - just loop over them making sure the characters are in fact digits and are not too numerous, and optionally allow a leading minus sign where needed. If we introduce a limit of at most 9 digits in there, then it would become safe for the callers to call atoi() next (like many currently do).

jfoug commented 6 years ago

I will start on this soon. Should not be that difficult. the hacks have served their purpose, and writting correct fully functional logic should not be hard to do.

seekaddo commented 6 years ago

Hi everyone, I am a student at UAS in Vienna, Austria. I think I can take this as my first good issue. I have already look at the function `` in the common.c. I think I can do this. Will it be ok to work on this task? Thanks

kholia commented 6 years ago

@seekaddo If @jfoug hasn't started working on this issue, you are very welcome to tackle it.

seekaddo commented 6 years ago

@kholia Thanks for the quick. Then I will wait on @jfoug.

jfoug commented 6 years ago

This would be a fine issue for you to start on. i was only undertaking it, because I was the one that originally produced the hackish way we are doing it today ;)

I have not started on it at all.

seekaddo commented 6 years ago

@jfoug Thank you. I will start working on it.

seekaddo commented 6 years ago

@jfoug @kholia This is what I have so far and I would like to know if I am on the right path. To my understanding isdec*() and friends should have a simple assert in it to make sure that the string passed is really an integer (with a limit of 9 digits to prevent Integer overflow) and also allow - where needed. I have programmed a simple digit_assert() function that does this. I check for the following sample of string inputs.

7964 --> should pass
-364 --> should pass is integer with size <= 9 - optional
19302989457995 ->> should fail, Integer overflow
9g64 -->> should fail, not an Integer. normally atoi will take only 9, but we fail here.
893029123 --> should pass, is integer with size <= 9.

The digit_assert() function return 0 on success and 1 on fail. And I will use this logic and function for all other methods isdec_negok,isdecu

One last question, should it only passed when the String is a digit (with '-' optional)? Thanks in advance for your feedback.

kholia commented 6 years ago

@seekaddo It sounds good so far. You can paste the code for your digit_assert function right here. That would facilitate a quick code review.

jfoug commented 6 years ago

2) -364

should fail isdec() and should pass isdec_negok() and should also fail isdecu()

3) 19302989457995

2147483647 should succeed 2147483648 should fail

We 'could' revert to what @solardiz was stating and fail anything over 9 decimal digits (10 with a leading negative in isdec_negok), but why not simply compare against 0x7FFFFFFF ok, and 0x80000000 fails

So any pure decimal number from 0 to 2147483647 should pass (all functions). Any number from 0 to 4294967295 should pass isdecu and any number from 0 to 2147483647 with or without a negative sign should pass isdec_negok()

Then you only have to deal with the leading '-' sign (skipping it) in isdec_negok, and then make sure that everything is a decimal digit (use the ctype.h isdigit() macros). Then you ONLY have to check the digits if the number is 10 characters long. If it is 11 character or more, simply fail. For 10 characters (in the billions range), you have to check that the number is equal or under the 31 (32) bit threshold.

That is really how the function should have been written up front. Yes, it does limit us to a 32 bit 'int' type compiler, but I see NO point at all, trying to deal with 32 bit ints and also 64 bit ints on the few machines which have int value > 32 bits. How this code is used (in format valid() function), if there is something over 31 bit long number, then we need to look closely to see if we are using it wrong.

jfoug commented 6 years ago

A first run, simply limiting max to 999999999 is just fine. It does not cover the whole int range, but again, anything we have that goes past this, likely should be treated with 64 bit data types. I am not sure if there is ANYTHING we do that would do this, other than possibly file offsets into a file to seek and read data. But usually we are hammering out the real data now into hashes, and not causing the loader code to read files.

jfoug commented 6 years ago

I would caution on using the word 'assert' in your function. By convention, an assert simply performs a runtime check on assumptions you make in a function, and when data is presented that does NOT fall into your assumptions, the program exits with an error. What you have is not an assertion, you really do not know that the numbers are decimal. you are checking IF they are, but you are assuming nothing.

I really think simply using the standard 'isdigit()' function (found in the ctype.h header), is the right tool for the job, and the compiler provides this tool in an optimized manner, vs a roll your own function. That function checks just 1 character at a time, so you will have to write all the logic to wrap it, calling it against each character. If any of them fail (exception is the leading '-' character), you know it is not a number.

seekaddo commented 6 years ago

@jfoug Thanks for the feedback. I put it in my code. I am using isdigit. I will then compare against 0x7FFFFFFF ok, and 0x80000000 fails. I used one function for all the methods, just to prevent code duplications. Thanks.

solardiz commented 6 years ago

I appreciate Jim's help here, but let me override a few things: let's limit max to 9 digits as I had proposed, and let's not use isdigit(), because we do not want this to be locale-sensitive and because proper use of ctype macros isn't that trivial (they accept int, not char, so it takes two casts, via (int)(unsigned char), to use them properly on a char).

If at a later time we want to allow larger numbers, then we could switch to (also surprisingly non-trivial - I can provide an example) proper use of strtoul() and thus 64-bit numbers via a different wrapper that would replace both isdec*() and atoi() at once (so the callers would need to be modified too).

seekaddo commented 6 years ago

@solardiz I used isdigit with local set to NULL. If not preferred, will do it with a normal if logic syntax. Thanks for the feedback.

seekaddo commented 6 years ago

After going through all the feedback, this is what I have for now. common.c

I have tested this with a series of numbers and they all pass for instance of isdec_negok, isdecu and isdec I am still rewriting and testing to see if I missed anything. I will like to know if I am on the right path, I will appreciate all kinds of feedbacks.

One last thing, I used sscanf (should i know strtoul than) because if it overflow It will still fail "INT_MAX and UINT_MAX" @jfoug @kholia

solardiz commented 6 years ago

Thanks. The uses of sscanf() are unacceptable. The old code with atoi() and sprintf() should be gone, too. The post-checks for > INT_MAX (etc.) won't help, because (1) an int variable can't satisfy this condition, (2) long int might not be larger than an int on a given platform, (3) an overflow might have already occurred inside sscanf(), in which case for unsigned types it'd be wraparound (so the post-checks will fail for that reason as well) and for signed types it'd be UB (so the compiler would have been free e.g. to emit code that wipes the operating system by this point).

We need a trivial loop per each of these functions, without any libc calls at all.

solardiz commented 6 years ago

In practice, I happened to have recently checked that at least for unsigned int at least glibc's sscanf() gives an error rather than triggers wraparound when the integer value is beyond range. But I don't know if this behavior is guaranteed per standards, and checking for errors from sscanf() is really tricky (your code doesn't even try).

solardiz commented 6 years ago

BTW, do we want to retain any difference between isdec() and isdecu()? If we limit them to 9 digits max, then perhaps they can become just one function, and we simply #define one name to the other in the header file. Jim?

Besides simplicity, one reason why I like the 9 digits limit for now is that it accommodates a format multiplying the decoded number of bytes by 2 to get the number of hex digits - I suspect we have cases like this. And to properly support multi-gigabyte sizes in those formats we need to review/revise their code and likely use 64-bit types anyway.

seekaddo commented 6 years ago

@solardiz thanks for the feedback. I wanted sure if long int is guaranteed to be larger enough. With the overflow, I was thinking since we are checking for an int overflow, any overflow in sscanf won't change anything.

Should I change the atoi to something more better like strtoul or I should stick to a 32bit 9 digit range and forget about any libc library.(just with simple loops.) Because if we change this to strtoul then all the function prototypes should always change.

jfoug commented 6 years ago

BTW, do we want to retain any difference between isdec() and isdecu()? I

Nope. The only reason for isdecu was to match/mirror persons using an unsigned atoi() ( built into jumbo). NOTE that function itself may overflow if given wrong data, but if we check, there is no problem. checking for [0 .. 999999999] should be more than adequate for isdec and then isdecu can simply call isdec through a macro, vs changing code in the source files.

solardiz commented 6 years ago

@seekaddo Thank you for explaining your rationale, but as you probably realize it was wrong. long int can be same size as int (it is in fact same size on many systems), and overflow in sscanf() could change things (also you didn't check for errors from sscanf()).

As I said, you should not call any libc functions from these functions at all. Not even strlen() - you could, but it's not helpful here as you can trivially check for the length exceeding 9 while you loop over the characters.

Also, even if you drop the libc function calls, the rest of your code is more complex than it should be (some unneeded flag variable). Please simplify it.

jfoug commented 6 years ago

@seekaddo, I think you should be checking each character of the STRING. No need to worrry about any atoi/strtoul, or any other library call. Simply make sure things are <= 9 CHARACTERS long (with the optional '-' character for the *_negok), and then make sure each of this 9 or less character are a digit. Easy peasy. So for the _negok, you can have a 1 character string (if the - is there, but simply ignore the - from then on) Then you simply make sure it is 9 or less characters, and each are a decimal digit.

seekaddo commented 6 years ago

@jfoug @solardiz Thanks for the feedback. sorry for not doing the error checking for sscanf, I wanted to know that I am on the right path and then clean up everything. Thanks for the support. It is really helping. @jfoug will be implementing that. Simply make sure things are <= 9 decimal digits with the optional '-' character for the *_negok, and then make sure each of this 9 or less character are a digit

jfoug commented 6 years ago

@solardiz once we make this change, we MUST put a comment out there that the function does not test for defined behavior of atoi, but simply makes sure the number is in range of [0 ... 999999999]

The prior function would work for any valid number that would work in atoi, it just was not 100% assured of failing for other numbers (aka undefined behavior).

magnumripper commented 6 years ago

What, should we document stuff now?? 😆

JK

jfoug commented 6 years ago

@seekaddo, that is a clean solution, and does not make things overly complex, BUT works flawlessly within that range.

jfoug commented 6 years ago

As long as you place the documentation (partial only), in 4 different places, each requiring 2 of the other 3 to complete the though. Oh, and be sure to sprinkle in some speeling errors (or is that just me ;)

magnumripper commented 6 years ago

It's best to play safe and document it differently in many places. Then we can point to whatever doc happening to suit our need in any dispute.

Seriously though. Let's document this in whatever header.h defining it, and nowhere else (not even in the corresponding .c file).

seekaddo commented 6 years ago

Good morning everyone.

I updated my implementation. digit limits <= 9,Integer in range of [0 ... 999999999] common.c Once again, I will kindly appreciate all feedback. Thanks guys. @jfoug I hope this is what you've been trying to explain to me.

magnumripper commented 6 years ago

I just had my morning coffee, but:

int isdec(const char *q)
{
    char buf[24];

    if (is32digit(q) != 1 || q[0] == '-') { return 0; }

    int x = atoi(q);
    sprintf(buf, "%d", x);
    return !strcmp(q, buf) && *q != '-';
}

What does the atoi/strcmp add? Wouldn't this do just the same?

int isdec(const char *q)
{
    return is32digit(q) && q[0] != '-';
}

magnumripper commented 6 years ago

int is32digit(const char *q)
{
    size_t counter = 0;
    for (size_t i = 0; q[i] != '\0'; ++i)
    (...)
}

In this tree you're not allowed to declare i inside the for statement. Also, you don't really need both counter and i, just i would suffice.

BTW I don't understand the name is32digit. What's with 32?

seekaddo commented 6 years ago

@magnumripper Thanks for the feedback. I wanted to know if the requirement for the integer checking and the boundaries were ok. Thanks, man. is32digit I used it to mean i am checking for a 32 bit integer limits. I could have the same code inside all of the three functions, it just that I don't want any code duplications. Will it be ok if I change name, or the duplicate will be ok?

seekaddo commented 6 years ago

something like this and make changes for all the others.

int isdec(const char *q)
{
    if (q[0] == '-'){
        return 0;
    }

    size_t i;
    for ( i = 0; q[i] != '\0'; ++i)
    {
        if ((q[i] >= '0' && q[i] <= '9'))
        {
            if(i > 9){
                return 0;
            }
        } else
        {
            return 0;
        }
    }

    return i <= 9;
}

seekaddo commented 6 years ago

Just like this without any is32digit common.c

seekaddo commented 6 years ago

Can we call isdecu inside isdec to simplify the code? something like this

int isdec(const char *q)
{
     return isdecu(q);
}

solardiz commented 6 years ago

@seekaddo Your functions as currently in your tree aren't as bad as they were before, but they're still overly complicated, long, sloppy, and slightly buggy (e.g., it looks like they'll think the empty string is a valid number, and isdec_negok() will allow a minus sign anywhere in the string). While I appreciate this opportunity for you to practice C programming and contributing to a project, and we'd like to gain a new contributor who would then proceed with less trivial issues, I think we got to a point where one of us should just write at least one of those functions to show how it should be.

Regarding aliasing isdec() and isdecu(), yes, both Jim and I actually suggested that in previous comments, although IIRC we were thinking a #define in common.h.

Jim, magnum - BTW, common.c still says it's copyrighted only by me, which in jumbo is not true. Please correct that.

jfoug commented 6 years ago

@solardiz is there some way I can make a blanket copyright statement, claiming that any work I have done anywhere within the project is covered, and simply be done with it, seriously......

seekaddo commented 6 years ago

@solardiz Thanks for the feedback. I don't want to give up yet. Sorry for all the trouble. Can I give a try and take my time to test it against all odds.

solardiz commented 6 years ago

@jfoug I think yes, we may/should invent a file where we'd make such statements, but that would be mostly for the case of inadvertently missed statements in the files themselves. I think it is nevertheless still useful to know who wrote what file. It's not only about copyright, but also about attribution, and about avoiding misattribution (I am moderately unhappy those current isdec*() are misattributed to me). And regarding copyright, it's sometimes helpful to have a shorter list of copyright holders for individual files than for JtR as a whole, as that may enable easier reuse of those files elsewhere (and without unnecessary/wrong attribution to a JtR co-contributor who did not contribute anything to the reused files).

@seekaddo Please feel free to continue. And you wouldn't exactly be giving up by letting us show you how we would have liked one of those functions written. There would be a "next level" in this game for you either way.

jfoug commented 6 years ago

@solardiz I do not mind placing a copyright in an original work (such as what was done with dynamic, or phpass, or later with the pbkdf2hmac*.h files), or when I do pretty much a re-write, like I did with mscash2, where I discovered the insight on 1/2 reduction of pbkdf2 and put that into header form for usage everywhere. But most of the time, simple bug fixes, code enhancements, I really do not care, and would much rather have a blanket statement somewhere.

solardiz commented 6 years ago

@jfoug I understand. I would also be unhappy about needing to add a statement each time I make a minor edit to code written by someone else. But we need to understand that misattribution of such edits to the original author(s) is not ideal. For really trivial edits, we could merely acknowledge them in comments rather than claim copyright in the file, and to satisfy those concerned that the edits might be copyrightable anyway we'd have that blanket statement in a shared text file.

jfoug commented 6 years ago

@solardiz I believe I have touch probably 50 to 75% of the files in the entire jumbo project. But I really do not think there is a need for spraying a copyright in all of them. I will if you absolutely think it is a requirement, but I just do not see the point on simple maint issues.

Now, if there are logic changes, or significant enhancements, say moving a format to SIMD, changing it from fat to thin format (using dynamic), then yes, I can see you wanting attribution (or a blame link, lol). But for a lot of run of the mill stuff, I personally would like a common place to relinquish ownership rights to the code.

seekaddo commented 6 years ago

@solardiz Thank you. I will give up my best. I will rewrite it again taking all the comments here into consideration and test it with a sample test cases. A sample wouldn't be a bad idea though

Sorry once again for all the misunderstanding.

jfoug commented 6 years ago

I could also put things like figuring out how to completing detect a password, vs a format that has false positives (I have done a few of those) That is absolutely something that should carry attribution. But simply moving some test to a different line in crypt_all, to avoid bad side effect (making this up, as this is a normal maintenance type job), is not really worth having comment on, even if it 'fixes' a format. If the code WAS there, just buggy, then the bug fix is not the issue, IMHO.

OK, SORRY @seekaddo for hijacking your topic. Lets get back on track ;)

Btw, @seekaddo solar is right, you are over complicating this. The whole thing should simply be a couple line function. Check the length. If >9 return 0. If length==0 return 0. If length == 1 and string is '-', return 0. Then after length checks, check if each character is >= '0' and <= '9' That is about it.

solardiz commented 6 years ago

@jfoug Would you be OK with spraying "with minor edits by JimF" comments that would not imply copyright (as long as we also add that "common place to relinquish ownership rights" to such minor edits, to avoid any concerns of third-parties)? Anyway, yes, this is off-topic for this issue, except maybe as it relates to these functions in common.c.

@seekaddo Thank you for your persistence.

seekaddo commented 6 years ago

@jfoug lol, no problem. Is also important.

jfoug commented 6 years ago

@solardiz I would be ok with that. This change to base64 is pretty large, and I am sure that there will be another time where I touch pretty much every format, and I can add the 'touched' signature at that time.

@jfoug lol, no problem. Is also important.

yes, but it probably shoulda been handled off list. I just happened to be reading this topic when I watch a new entry be posted by solar, so I knew he was watching ;)

seekaddo commented 6 years ago

Good evening everyone! I am back again, this time without troubles. I refined my implementations, maximal 7 lines of code. Your feedback, please. isdec* As jfoug and solardiz agreed/suggested #define will be used for isdec

jfoug commented 6 years ago

@seekaddo Looks very good.

the only thing I would say, is you do not need the 2 ((...)) in this expression (you have 2 of these expressions)

if (!((q[i] >= '0' && q[i] <= '9')) || ++i > 9)

if (!(q[i] >= '0' && q[i] <= '9') || ++i > 9)

is the exact same thing. No biggie, just a small cleanup

Wait a minute. Is there an i=i++ undefined usage lurking in there??? @solardiz what is your take on this? If so, then that WILL have to change, increment the i outside of this expression. My reading is that it is undefined to use and make changes to the same variable within the same sequence point. I am not 100% sure adding the parenthesis changes or makes a new sequence point (I do not think it does).

But now I would strongly recommend the code be changed from this:

while (q[i] != '\0')
    {
        if (!((q[i] >= '0' && q[i] <= '9')) || ++i > 9)
            return 0;
}

to this

while (q[i] != '\0')
    {
        if (!(q[i] >= '0' && q[i] <= '9') || i == 9)
            return 0;
        ++i;
}

Even if the code is 'not' undefined, it will make people think it is and start to research it. @seekaddo NOTE, just because this usage happens to work, does not mean it always will, or worse yet, it will fail in different ways on different compilers, or with different optimization flags on the same compiler.

solardiz commented 6 years ago

@seekaddo This is much better than before. Still overly complicated, though. The checks on the final return statements are redundant - by this point we already know we're OK. The check for *q == '-' at the start of isdecu() is unneeded. And I think I'd write the whole thing slightly simpler, with only one local variable (and it'd be a pointer, but that's a matter of taste).

Also, I don't recall what the current approach to this in jumbo is, but previously we avoided mixing declarations and code for ISO C90 compatibility. We even used a gcc option by default to disallow those, so we'd catch them early. If we went fully C99+ since then, that's OK.

@jfoug No problem there since || introduces a sequence point. Moreover, C guarantees short-circuit evaluation, and we probably already rely on that elsewhere in JtR.

openwall / john

Rewrite common.c: isdec*() #2951