joeyvanlierop / xkcdbot

A reddit bot that automatically links xkcd comics in the /r/xkcd subreddit 🤖
MIT License
37 stars 20 forks source link

Increase specificity of regex matching #15

Closed Nyhilo closed 4 years ago

Nyhilo commented 4 years ago

(?<= ^! | \s! | (! ^# | \s# | (! ) This ensures that there is a white space or open parenthesis before the code character. It makes sure the match only happens if the pattern isn't embedded inside other text (or is in parentheses).

\d{1,4} This ensures that we only match if the number after the bang is 1 to 4 digits long. If Randall keeps updating weekly, it will take about 148 years before we reach a comic #10000. Limiting the number of digits being matched will greatly reduce false negatives.

(?= \b ) This ensures that the end of the match is only on a word boundary. This allows for punctuation, whitespace, etc. But will prevent matching on accidental junk like !123&adf8 or whatever.

The regex in this PR matches correctly on the following: !1234

1234

Let's look at !1234
Let's look at #1234.
...looking at the comic in question (!1234) we can say that...

But not: I spent $123.45 today Comic #12345 doesn't exist website.com/?=!1234 website.com/?=#1234 I have made a mistake referencing this xkcd#1234 some junk -> a^698f7%s@ !123a*b@3

Nyhilo commented 4 years ago

@joeyvanlierop Had to remake this PR because I screwed up my git configuration in the last one.

Full disclosure, I'm still pretty new to this whole "contributing to FOSS" thing. I'm not actually sure how to run the test suite here, but I included them in case someone else wanted to pick up that mantle.

If you have any advice for me, it would be greatly appreciated.

joeyvanlierop commented 4 years ago

After poking around with the tests for a few minutes, I ran into some issues. I believe there are some regex nuances in Python regarding look-behind. The tests are throwing this error:

re.error: look-behind requires fixed-width pattern.

I wish I could figure it out, but I am pressed for time right now.

There are a few syntax errors in the test file, but overall, great work!