Input validation missing for comment body edge cases.

jgreenemi commented 4 years ago

Just found this project from the reddit thread! I like what I see - the whole project appears well-structured and clean. This issue is to help in validating user inputs.

Example of the behaviour this issue seeks to address: https://www.reddit.com/r/xkcd/comments/erydbl/introducing_ubobbytablesbot/ff7a610/

This is probably pretty obvious but pointing out all the same - validation on the inputs looks to be necessary at either 1 where the the body of the comment is pulled out, or 2 where the find_numbers actually makes use of the input. Looks like the one case where user input has been shown to cause a problem is when the input is simply too large, in which case you could curb this from breaking by doing one or multiples of the following:

Introduce a try/except block so a single user's invalid input doesn't cause the whole bot to stop for everyone, just for the one request.
Try to prevent a large input (like when a user writes an essay into a comment body, which could be unlikely to ever invoke the bot anyway) by truncating the body string to a manageable size. One could introduce a simple heuristic like...
```
body = comment.body[:1_000]
```
This heuristic ensures your comments can only ever be a certain size, and doesn't fail in the case that the input is an empty string. This of course relies on the input actually being a list-like object (as strings are in Python) so if for some reason the comment object were malformed this would fail. In that case, a try/except block within handle_comment() would help here.
Add further logic to valid_comment() to handle more of these cases. The truncation approach is likely a good addition here.

joeyvanlierop commented 4 years ago

I agree that truncating the comment length would be a good idea. On a similar note, it might be worthwhile to limit the amount of digits that could be matched per group. By changing the regex from "\d+" to "\d{0,10}", it would limit the amount of digits per group to 10 digits. Although this might not be directly related to your issue, it is probably a good change considering that there are no xkcd comics ids with more than 4 digits.

jhonnystene commented 4 years ago

I'm working on a fix.

jhonnystene commented 4 years ago

@aamirrasheed No, I'm not. Go ahead.

yao-charlie commented 4 years ago

I'm assuming this is still open? Seems quite approachable for 'my first issue' :)

Is deshy-fs' solution insufficient? I suppose technically you could run into a 'y2k' problem but assuming 1 comic a week (I can't remember the posting frequency, it's been ages since I've regularly browsed xkcd), that's about 20 thousand years....

Are there other specific edge cases you want to handle? Ones that aren't solved by just ignoring what the user is inputting?

joeyvanlierop commented 4 years ago

deshy-fs's solution seems to be sufficient, although he never added tests which are nearly essential at this point. If you are wondering why the pull request never got merged, it is because he closed it himself.

Feel free to fork this repository and utilize some of his code to implement your change. If you get to a point where you would like to merge your changes back into this repository, please make sure you have added some tests before opening a pull request. It gives me peace of mind and is also great practice for your future in software development!

Regarding edge cases, I can't think of any specific examples. I would try implementing your change and then running the test suite and seeing if anything is broken. You might find that there are some unexpected results that you have to accommodate for.

yao-charlie commented 4 years ago

Sounds reasonable. I'll take a look. I'm probably going to look into generalizing it - e.g. scrape the latest comic # and set that as an upper bound - that way it's theoretically indefinitely scalable, as long as URL formatting doesn't change. Let me know if you think that's already a bad path to go down :)

joeyvanlierop commented 4 years ago

There is already a function to retrieve the latest comic which you might find useful.

I will leave it up to you to decide how you would like to implement your solution. I think that I myself would just truncate the number to a relatively small amount of digits, but in no way am I claiming that it is the best solution. No pressure either way though, just try to learn something and have fun!

yao-charlie commented 4 years ago

I think I've mostly got it working but trying to write the test cases... I just want to check that whenever I call get_comic(self, number), that 'number' is already stripped to be just an int being passed? Your unit tests suggest that's the case since you're not testing anything but ints for numbers but want to double check my assumptions.

As an aside, is it bad form to ask about someone else's code in a production environment where the 'answer' is in the code? For this, python seems a little less careful since you don't explicitly specify the argument type.

Also, you suggestion re: function was very helpful!

Edit: speaking of it... I'm assuming the crash happens with the function call to get_comic() - is it happening there? Else, is it happening in handle_comment() or somewhere else?

stefanosporiazis commented 4 years ago

@yao-charlie I too noticed that we are passing integers to get_comic in the tests, even though in bot.py we're actually passing strings.

But that isn't an issue since firstly, Python sets the variable type based on the value that we assign to it and secondly, get_comic uses f strings, to form the comic's URL and the logging messages, which evaluate the provided expression and always produce a string.

So passing int or str won't make a difference except if we pass 404 as an int, because get_comic compares number with the string "404" which will evaluate to False.

Still, I think we should always pass strings just to be safe. @joeyvanlierop Please correct me if I'm wrong. :+1:

yao-charlie commented 4 years ago

@deshy-fs - thanks for the feedback! I understand how Python's EAFP model works with variables and it can be mostly easier to work with. I get that most of the tests don't need to compare 'numbers', I just wanted to double check that when I was about to compare numbers as I'm doing a greater than check vs. the latest comic number, that I was intentionally converting it to an int and won't expect a 'real' string ever in the future. Of course, as I'm not familiar with all of Python's functionality, maybe I'm missing something that can be done with such operations with f strings.

That said, as I've explored more of the code, I'm thinking the crash was not really due to lacking the comparison function within get_comic(), but perhaps more in the saving the comment? I'm assuming requests doesn't have a weird upper bound and the xkcd site doesn't have a problem returning 404 for what seems like that arbitrarily large number, so the error must happen before. Just want to check as I don't think the unit tests cover this well - but I could be wrong being a novice and only looking over the code for a few hours.

stefanosporiazis commented 4 years ago

@yao-charlie I'm sorry, I don't understand. What crash are you talking about? Are tests failing or is the bot crashing? And what error messages are you getting? It would help if you could post relevant parts of your code. Thanks.

For future reference: I looked a bit more into the comic_ids list and it seems that it's being filled both with strings and integers. But again, this doesn't break get_comic since we use f strings.

match_token returns a list of strings
get_latest_comic returns an integer
match_random returns a list of integers

yao-charlie commented 4 years ago

@deshy-fs re:crash, just what was stated as the reason for reviewing this item - that reddit comment which crashed the bot. I don't think there are actually unit tests which target it right now (unsurprisingly). I'll check into your other statements.

stefanosporiazis commented 4 years ago

@yao-charlie Does the bot actually crash? Or does it just hang / not respond? I don't think there's a problem with http requests or python strings when it comes to processing large input (I'm sure there's a limit but it must be much higher).

yao-charlie commented 4 years ago

@deshy-fs I don't know. I've not run the bot myself so I haven't replicated the exact crash. I've been working on some other stuff since so I haven't looked more in depth yet. Will advise when I do.

stefanosporiazis commented 4 years ago

@yao-charlie @joeyvanlierop It seems that for the input in question the bot raises an exception, which is caught here, so it just skips the comment instead of crashing.

The exception message is "Expecting value: line 1 column 1 (char 0)" which after a bit of digging I found out that it is raised by the json method on the statement config = response.json() below: (link to code here)

        if response.status_code == 404:
            logger.warning(f"Comic {number} returned a 404 status code")
            return None
        elif response is None:
            logger.warning(f"Comic {number} returned none")
            return None
        else:
            logger.info(f"Got comic with number {number}")
            config = response.json()
            return config

So basically, because the comic doesn't exist, the response is not in json format and the json method raises an exception. I thought the first if statement would prevent this but it seems in this case the response code is 503 instead of 404. Which is probably sent as a response when the input is too big? I'm not sure.

So a simple solution would be to check if the response code is 200 before proceeding to set config, to make sure the server returned the document that was requested which should be in json format. I still think we should truncate the comment body down to let's say 10_000 characters and perhaps modify the regex to only match ids up to 5-6 digits long.

Let me know if you want me to work on this.

yao-charlie commented 4 years ago

@deshy-fs Sorry for the late response - as noted some other items I was focusing on.

I'm certainly a newbie for response code behaviour but would a web browser return a different value from response.status_code? I would assume not but as noted, no expert. I get '404' when typing 'pi' less the decimal from the reddit attempt. Of course, if that were a simple case, I would assume the test cases in get_comic() would be sufficient to catch this.

As for your note, again from my limited understanding, technically there are ranges response codes which could also throw similar errors so perhaps it's better to check for a range for success?

Else, re: truncating, and matching ID lengths, I think it's a fine solution. I was trying to be more generic and set a variable to check against for the latest comic # so that it's infinitely scalable - but that only works if the error was in the code rather than due to a dependency, which is why I asked.

'Edit': hilariously enough, for some reason, I tried going to xkcd again with 'pi' but this time I got a "header overflow" error. Limited knowledge but seems to be a CSS-related error?

stefanosporiazis commented 4 years ago

I get '404' when typing 'pi' less the decimal from the reddit attempt. Of course, if that were a simple case, I would assume the test cases in get_comic() would be sufficient to catch this.

I always get a 503 response.

As for your note, again from my limited understanding, technically there are ranges response codes which could also throw similar errors so perhaps it's better to check for a range for success?

That's what I think. Although from my knowledge, the requested file is only returned when the status code is 200. 201 creates the resource but doesn't send it, and 202 accepts the request but creates the resource later(?). I need to look more into "success" codes. But from my understanding, we only receive the JSON file when the code is 200.

I was trying to be more generic and set a variable to check against for the latest comic # so that it's infinitely scalable

If you can make it work indefinitely, that would be even better. Although requesting the latest comic every time could slow down the bot a bit. If I'm not mistaken, it takes about 1-3 seconds for each request (at least for me), whereas the truncation and regex solution probably costs a couple of ms. Feel free to do it however you think is best though!

I tried going to xkcd again with 'pi' but this time I got a "header overflow" error. Limited knowledge but seems to be a CSS-related error?

"Header overflow" probably means that an HTTP header overflowed. I don't think it's related to CSS in any way.

Let me know if you need anything else :smiley_cat:

yao-charlie commented 4 years ago

@deshy-fs I'll check more on what response I'm getting - let you know if I am getting differences but perhaps it's just the way a web browser communicates incomplete/failed requests on the front end.

I added a case to the code to return if we're not getting 200. It might be too restrictive but take it as a placeholder for now if we get to understanding more server response code behaviour.

Although requesting the latest comic every time could slow down the bot a bit. If I'm not mistaken, it takes about 1-3 seconds for each request (at least for me), whereas the truncation and regex solution probably costs a couple of ms.

You're right that it's a trade off. I've tried to limit its scope by carrying a variable to check against before making the request for a latest comic - thus it should only experience the slowdown if we're requesting a comic out of range. I think that's a fair trade-off as it should be pretty rare as is since mostly we're looking at typos or people doing this maliciously.

I do think both solutions are fine - it's more like 'what would you do in the real world to solve this' vs. 'what could theoretically happen to limit this' types of solution. An engineer would say the approximation is sufficient, whereas a mathematician wants it to be true in general. I've always preferred the latter but in a business and real life context, the former is generally better for being sufficient.

I guess you can check out my revisions here.

stefanosporiazis commented 4 years ago

@yao-charlie Any update on this?

yao-charlie commented 4 years ago

@deshy-fs I linked my suggested revisions and haven't heard anything more about what's causing the error. Not sure of the protocol otherwise - didn't feel appropriate to issue a pull request unless the behaviour was better defined but linked my revision in case it was sufficient.

stefanosporiazis commented 4 years ago

@yao-charlie

I think the "Does not exist yet!" response for requests beyond the latest comic is a nice touch although I would personally skip the requested comic instead of responding, to make the bot as fast as possible but maintainers should have the final word on this.

Always checking if the response code is 200 should be sufficient to solve this issue but I think it's best to also truncate the comment's body down to 5-10k characters in order to avoid any future bugs/unwanted behavior when we receive unusually large input. Also, I think it's best to first check if response is None before proceeding to check what the status code is, to avoid any exceptions.

Lastly, we could also modify the regex to only match up to 5-6 digit numbers, although I don't think this is necessary as long as we truncate the comment's body. I guess it comes down to personal choice.

When you feel your changes are sufficient to solve this issue, feel free to create a PR.

joeyvanlierop / xkcdbot

Input validation missing for comment body edge cases. #1