j-andrews7 / kenpompy

A simple yet comprehensive web scraper for kenpom.com.
https://kenpompy.readthedocs.io/en/latest/?badge=latest
GNU General Public License v3.0
73 stars 21 forks source link

Login failure due to Cloudflare intercepting requests #33

Closed steveroks closed 1 year ago

steveroks commented 2 years ago

Hello,

I am getting the following error when trying to use the login function: LinkNotFoundError()

I am using the following code:

browser = login(email, password)

linknotfounderror

This used to work fine last season

Appreciate the help!

steveroks commented 2 years ago

Here is the response im getting. Seems it is being blocked by Cloudflare...

Sorry, you have been blocked

You are unable to access kenpom.com

Why have I been blocked?

This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

What can I do to resolve this?

You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.

esqew commented 2 years ago

Thanks for the report! This was previously discovered in #24. A patch was subsequently issued and merged into master as part of #25, however the release on PyPi hasn't caught up to what we've got here.

If you need this functionality in the immediate term, you could install from the GitHub source but I imagine @j-andrews7 will push a new release with all the bug fixes we've got in the hopper before the season starts.

j-andrews7 commented 2 years ago

Yeah, I'll push a release to PyPi. And probably try to get stuff set up so Sean can push them too. Real work is making it really hard to carve out any time for this lately.

On Thu, Oct 27, 2022, 10:37 AM Sean Quinn @.***> wrote:

Thanks for the report! This was previously discovered in #24 https://github.com/j-andrews7/kenpompy/issues/24. A patch was subsequently issued and merged into master as part of #25 https://github.com/j-andrews7/kenpompy/pull/25, however the release on PyPi hasn't caught up to what we've got here.

If you need this functionality in the immediate term, you can install from the GitHub source but I imagine @j-andrews7 https://github.com/j-andrews7 will push a new release with all the bug fixes we've got in the hopper before the season starts.

— Reply to this email directly, view it on GitHub https://github.com/j-andrews7/kenpompy/issues/33#issuecomment-1293719357, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOAQNH4ZAPYMH46PVZDKCDWFKOUNANCNFSM6AAAAAARO3A3DQ . You are receiving this because you were mentioned.Message ID: @.***>

steveroks commented 2 years ago

Thanks guys!!

trludt commented 2 years ago

Any update on this issue?

j-andrews7 commented 2 years ago

Will try to push a new release tonight.

On Mon, Nov 7, 2022, 8:11 AM trludt @.***> wrote:

Any update on this issue?

— Reply to this email directly, view it on GitHub https://github.com/j-andrews7/kenpompy/issues/33#issuecomment-1305674731, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOAQND3LDPPTANUVW6UF5TWHEEZDANCNFSM6AAAAAARO3A3DQ . You are receiving this because you were mentioned.Message ID: @.***>

trludt commented 2 years ago

Thanks boss! Love what you've done here

j-andrews7 commented 2 years ago

This should be resolved in the latest release. Or at least the tests pass. If everything's still broken, blame @esqew.

Do not actually do that - he's the only reason this ever got fixed. Thanks Sean!

pip install --upgrade kenpompy should get you the latest version with this fix and a few others.

trludt commented 2 years ago
Screenshot 2022-11-07 at 9 22 47 PM

Still getting a Cloudflare error?

j-andrews7 commented 2 years ago

Working fine for me.

What's pip freeze | grep kenpom give ya?

trludt commented 2 years ago

kenpompy==0.3.3

j-andrews7 commented 2 years ago

Hm. @esqew, have you noticed any inconsistencies with this? Working fine with python 3.8-3.10.

@trludt can you try a few more times and see if you can get through? We are somewhat at Ken's power here, and if for whatever reason your IP is subject to more stringent measures, there's likely not a ton we can do.

trludt commented 2 years ago

I've tried a bunch still no luck... This is wild. I probably pulled data once per day of the season last year. Could that have restricted my IP? And does a VPN change anything?

j-andrews7 commented 2 years ago

I asked Ken about this before I published it, and he basically said as long as people didn't abuse it, he was totally cool with it. I can't imagine once a day is a problem.

Not sure about the VPN, though it could make sense if it's changing your IP geolocation to an area identified as more likely to be an attack.

trludt commented 2 years ago

I just shot him an email. Hopefully he gets back to me. I've tried the browser = (email, pass) line at least 50 times today and still errors every time.

trludt commented 2 years ago

FYI I'm using Spyder v5.1.5 and Python 3.9.12 in my script if that helps towards anything

esqew commented 2 years ago

I unfortunately cannot reproduce this on my end, and the CI/CD tests checked out fine last night.

@trludt Where does your VPN reside geographically? Cloudflare will more closely scrutinize connections from IPs in higher-risk areas. Separately, can you provide a stripped down example that demonstrates this? Is it possible you're also trying to log in a bunch of times or scrape a ton of data in a short period?

trludt commented 2 years ago
Screenshot 2022-11-08 at 8 05 21 AM Screenshot 2022-11-08 at 8 10 10 AM

I am always running into the Cloudflare error on that login step. I'm not always using the VPN (ExpressVPN) but when I am using it, my connection runs through primarily big US cities (Atlanta, Dallas, Phoenix)

esqew commented 2 years ago

Thanks for the context. Without being able to reproduce this issue, it's likely this is something relating to how Cloudflare is handling your specific connection or sessions.

It's possible Cloudflare's heuristics have detected either/both that (among other possibilities) (a) your account has logged in from multiple geographies in too short of a time span ("impossible travel" anomaly detection), or (b) your script, at one point or another, launched too many requests and Cloudflare is now taking it upon itself to block further requests for an indefinite amount of time.

If you're able, can you set your debugger to break on this exception and dump the contents of str(browser.page.contents) here to ensure that this exception is firing correctly at the very least?

trludt commented 2 years ago

Would that be in my script or altering the utils.py file under the Kenpompy code? @esqew

esqew commented 2 years ago

Neither, but there may be an easier way now that I think about it, this simplified script may be easier to get the contents of the page that Cloudflare is throwing to see if there's anything specific about your case we can pick out from it:

from kenpompy.utils import login
try:
    browser = login('<email>', '<password'>)
except Exception as e:
    print(str(browser.page.contents))
    raise e

Ultimately what I would really like to understand is what HTTP status code Cloudflare is throwing to see if that might be more illuminating in this situation, but the way we've got MechanicalSoup set up doesn't currently capture the right details to enable this. Hopefully the HTML Cloudflare is sending back will indicate something helpful to preclude this next level of debugging.

trludt commented 2 years ago

`['html', '[if lt IE 7]> <![endif]', '[if IE 7]> <![endif]', '[if IE 8]> <![endif]', '[if gt IE 8]><!',

Attention Required! | Cloudflare

Sorry, you have been blocked

You are unable to access kenpom.com

Why have I been blocked?

This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

What can I do to resolve this?

You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.

, '\n']`

trludt commented 2 years ago

Sorry that was when I was connected to VPN

trludt commented 2 years ago

['html', '[if lt IE 7]> <![endif]', '[if IE 7]> <![endif]', '[if IE 8]> <![endif]', '[if gt IE 8]><!',

Attention Required! | Cloudflare

Sorry, you have been blocked

You are unable to access kenpom.com

Why have I been blocked?

This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

What can I do to resolve this?

You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.

, '\n']

esqew commented 2 years ago

So it's definitely a Cloudflare interstitial (so the exception itself is working as intended). Unfortunately without a reliable way to reproduce this behavior outside your specific environment there's not much else we can advise you to do. Cloudflare deliberately doesn't share a ton of details to make it more difficult to circumvent their protections. Unless you're able to get in touch with KenPom himself (or someone who handles his Cloudflare account) and can look at the logs specific to the Cloudflare Ray ID values from the markup you posted, we're just shooting in the dark.

If you have the resources available, you may consider attempting to try this same code from another machine and/or using another KenPom account to see if it's reproducible across environments/accounts. The only other course of action I could recommend at this point would be to drop this altogether for at least a couple days to let any rate limiting that may be at play get fully reset.

trludt commented 2 years ago

I tried a different KenPom account and that didn't work. Also tried from my PC instead of my Macbook, no luck. I shot Ken an email with the Cloudflare Ray ID and my IP... hopefully I can get unbanned 🙏

This tool rocks and would hate to lose access to it. Truly had no intention of abusing it or overloading his site.

I appreciate @esqew @j-andrews7 for the assistance, much appreciated!

AVA-27 commented 2 years ago

@trludt Please let me know if he reaches out to you on this issue. I am having similar results being blocked by Cloudflare. I originally had my own scraping pipeline developed through beautifulsoup that I used last year (about daily similar to you). Recently I ran my script to see if it still worked for this season and I received the 403 forbidden code. I then tried this python library as an alternative and received the same block. In my personal pipeline I also tried modifying User-Agent status and several other request headers and nothing resolved the issue.

RobMepham commented 2 years ago

@trludt and @AVA-27 I have also received the same LinkNotFoundError (now is replaced by Exception: Opening kenpom.com failed - request was intercepted by Cloudflare protection) after forcing kenpompy update via pip)

I have also reached out to Ken via email and will let you know if I hear anything.

For context, I use a VPN (NordVPN) out of Atlanta (physically located in TN). Last night at about 2:00 AM CST I received the first error (on VPN), and tried again disconnected (still received LinkNotFound error).

Today, after updating, I explicitly tried to re-run while being disconnected from VPN to see if there was a difference, but I'm afraid I might already be blacklisted b/c of the re-geotagging of my first request via VPN.

AVA-27 commented 2 years ago

@RobMepham I never utilized a VPN to access the website and still cannot gain access so I am not sure if that is playing a role or not. Edit: I also reached out to Ken via email; if I receive a response I will update as well.

trludt commented 2 years ago

Good to know I'm not alone! Currently working on migrating my script over to a lambda function on AWS, which was ultimately my plan all along. Maybe this is a good thing my hand is being forced 🤣

jacksonw765 commented 2 years ago

Can confirm I am experiencing this issue as well.

esqew commented 2 years ago

So it seems Cloudflare interception may be more prevalent than I had initially anticipated. At this point, we know (or can reasonably infer) that the following factors play at least something of a role in determining whether or not to intercept requests, likely among many, many others:

  • Requesting against KenPom.com without including a User-Agent header
  • Using a "browser" without JavaScript support or otherwise disabled (curl kenpom.com will, more often than not, result in the same Cloudflare interstitial)
  • Using while connected to a public VPN provider
  • Connecting from an IP block associated with a "high risk" geolocation

The latter two are a bit out of our scope of control in this instance, but I will be carving time in the coming days to tee up some experimentation where we can try to reconfigure how the requests are sent for those affected by this. Watch this space!

trludt commented 2 years ago

This has been incredibly frustrating. I've been having my friend run this on his computer. Sometimes it works, sometimes it doesn't. It hasn't been working in an AWS lambda function either, triggering the Cloudflare exception.

Considering we all already pay for the access to the data, I wish we could at least get a response back from him why he's doing this

j-andrews7 commented 2 years ago

It's a busy time of year for him, I expect he's travelling a lot and such. He has responded to me in the past via e-mail, so I'd just advise some patience. It is an off-label use and all.

Like @esqew said, there are a few kind of hack-y workarounds we can try, but there's no guarantee. Given that neither of us is able to reliably replicate the issue, we may need folks to test, so stay tuned.

esqew commented 2 years ago

I had a few hours this morning to start experimenting specifically with how the User-Agent value actually affects what comes back from the server.

In short, using valid User-Agent values from modern browsers (h/t to @pzb's list) actually triggers the Cloudflare interstitial, and flipping back to the standard Mozilla/7.0 we've got in main right now works as intended (in my environment, anyway). On this same tested-working connection path to the target site, curl actually works as well when given the explicit User-Agent value to use: curl "https://kenpom.com" -A "Mozilla/7.0" As such, I think it's fair to surmise the User-Agent header as it's currently configured in utils.py is unlikely the culprit causing these interstitials to pop.

Knowing this, I would deduce that the blocks that have been reported are moreso driven by the IP ranges from which folks are running this code. Since popular VPN and cloud providers' IP blocks are well known, it is quite trivial for Cloudflare to block traffic originating from these sources. Since I haven't used Cloudflare in many years I had to do some research on how this is typically configured - turns out this blocking is fine-tuned by site owners/administrators. From a thread describing a similar issue on Cloudflare Community:

At Cloudflare you have many options to block. You can filter for ASN, IP, IP ranges, country etc… and many more. But if you are blocked, then the website owners configured it to do so. Feel free to contact the website owner and ask him these questions

What I am curious about is if this protection would also apply to a setup which would present itself as a modern browser with full-fledged support for JavaScript (a la Selenium, pyppeteer) that's running from within one of these IP ranges, but I dread the idea of bringing on something that has such a huge comptuational overhead compared to what we have working right now.

In any event, I'm going to start experimenting down this route running in an Azure/Colab/other cloud environment and report back. If this does seem to alleviate some of this blockage we'll have to see what we can do in terms of marking these more resource-heavy libraries as extras to be used in the event that all else fails.

I would also really like to see if Ken does return anyone's email and what his opinion on the whole thing is, but I'm not holding my breath - he is a pretty busy guy.

j-andrews7 commented 2 years ago

I also played with the user-agent settings and found much the same.

I originally tried the selenium route previously, but it was...not trivial and definitely had way more overhead.

I have managed to get a response from Ken before, but he's likely very busy and potentially travelling right now.

trludt commented 2 years ago

I'm curious if he's open to someone helping him build an API that people can pay a little extra for access to the data. I brought that up in my last email to him

jkiddUA commented 2 years ago

Wanted to chime in here for anyone still having issues. We were able to workaround the 403 error by adding this line to our code. browser.set_user_agent('any-random-thing')

Added that line right below "browser = mechanicalsoup.StatefulBrowser()"

This was following advice from this post https://stackoverflow.com/questions/48506614/403-error-with-mechanicalsoup.

Edit: Looks like there might be a something else involved. Our data analyst was using Juypter labs to run his code and even with the change he was still getting 403'd. Maybe an issue with module versions or how juypter formats the traffic. I installed python fresh with all updated modules and it worked running from the python shell from two computers both with different outgoing IPs

Works in Python 3.11.0 kenpompy-0.3.3 mechanicalsoup-1.2.0 pandas-1.5.1 bs4-0.0.1 beautifulsoup4-4.11.1 requests-2.28.1 lxml-4.9.0 (manually installed because windows) python-dateutil-2.8.2 certifi-2022.9.24 charset-normalizer-2.1.1 idna-3.4

jacksonw765 commented 2 years ago

@jkiddUA kudos to you, I was able to find a workaround. Updating to python 3.11 has fixed the Cloudflare issue, and I can now load data successfully.

Harrisoneller commented 2 years ago

I updated to python 3.11, however I tried to reinstall kenpompy, and receive an error "Error: failed to build wheel for lxml". How did you get around this once you upgraded?

jkiddUA commented 2 years ago

I updated to python 3.11, however I tried to reinstall kenpompy, and receive an error "Error: failed to build wheel for lxml". How did you get around this once you upgraded?

So I had to download the lxml wheel (.whl) from here https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml. Just download the correct one for your machine and place it in your working directory. In my case it was lxml‑4.9.0‑cp311‑cp311‑win_amd64.whl because I'm using a 64bit windows machine. And then from my virtual environment I ran the command, pip install lxml‑4.9.0‑cp311‑cp311‑win_amd64.whl.

Once that's successfully installed you should be able to run the kenpompy install.

esqew commented 2 years ago

Hi all, my testing the use of browser control libraries to overcome the reported Cloudflare issues remains ongoing as my availability permits. To recap what we know for sure:

  • Connections originating from infrastructure with outbound IPs assigned to popular public cloud providers (Azure/AWS/etc.) will be blocked/filtered
  • Connections originating with outbound IPs known to be associated with popular public VPN providers will be blocked/filtered
  • Connections without a "valid" User-Agent HTTP header will be blocked/filtered

Potential mitigations:

  • Update to the latest version of kenpompy that's been published to pypi, if you haven't already:

    python3 -m pip install --upgrade kenpompy

    This will ensure that the latest patch with an appropriate User-Agent header setting is in place.

  • Some users indicate that updating to Python 3.11 may helpref, but it's unclear if this will work for 100% of users experiencing Cloudflare-related issues, or even why it works in the first place
  • Using a browser automation framework instead of the MechanicalSoup stateful browser (currently being developed and tested locally)
  • For those who must run their workload in a public cloud, it may be worth exploring whether an HTTP proxy might help to mitigate this - I'm looking to see if it might be helpful to allow the passing of a preconfigured MechanicalSoup instance to login() to allow for this type of configuration up front UPDATE: this is not likely to work after quick tests with cURL and some publicly-available proxies; KenPom.com returns 400 Bad Request for each. May be worthwhile to continue a bit more in-depth testing, but for now I'm considering this a non-viable option
  • Write to KenPom and beg/plead ask how we might be able to sidestep this protection for scraping purposes
esqew commented 1 year ago

Hi all, I haven't lost sight of this! My local testing continues as my availability allows with a selenium-based retrofit for the MechanicalSoup functionality we use today. I know progress has been excruciatingly slow, and that's simply a biproduct of my full-time role (and my life more generally) being as hectic as ever with the end of the year fast approaching.

Interestingly, using selenium with a "headless" WebDriver is filtered by Cloudflare 100% of the time. This is slightly surprising (at least to me) because primitive/headless-only clients (like cURL), with the proper User-Agent header, are not normally filtered in the same way.

As a result, those who do require headless functionality are likely going to find this solution untenable, at least if/until we find a workaround. I'm aware of projects like selenium-stealth which have in the past been able to overcome this detection of headless webdrivers, but are now largely unmaintained (ostensibly due to the strides these providers have made in detecting them anyway). I'm open to any and all suggestions from the folks here if someone is aware of something more modern that might have a better success rate.

(Somewhat separately but for complete transparency, I had attempted to do this retrofit using a pyppeteer base a few weeks ago, but its asynchronous-first design leads to quite a bit of complexity trying to implement it in a reliable, robust way that also supports everyone who's using it in a Jupyter environment. In the same vein, unless I've missed something huge, I don't think it's got any mitigations for the Cloudflare detections that selenium does not considering it's the same Chrome/Chromium browser at the core. I take that back - seems there is a pyppeteer-stealth package that's actively maintained and may be worth re-exploring this thread as a result.)

The next test I'm taking on is running non-headlessly from within an Azure environment to see if our hypothesis holds true that this should bypass Cloudflare's heuristics. If successful, I'll publish an experimental branch for others to test in the near future. More to come!

mbrundige commented 1 year ago

I was trying to figure this out yesterday and I can't find anything that would indicate it is an issue specific to MechanicalSoup. In fact, I actually found the problem appeared to originate in the python requests library. There is clearly something that is not "liking" the way the library is sending the requests (Cloudflare most likely has a ruleset in place since majority of scraping is through python). I can't quite put my finger on it.

I tested with a popular ruby library, HTTParty while on my VPN (through NordVPN) and got a 200 and like you mentioned cURL with the same exact headers works.

I am going to continue digging when I get a bit more time.

johnfeldhausen commented 1 year ago

Hello,

I am new to this package, and I just purchased my kenpom account a few days ago. I am also having the same issue:

Exception: Opening kenpom.com failed - request was intercepted by Cloudflare protection

I am wondering if anyone has had any luck outside of this website to scrape data for NCAA mens while these issues with Cloudflare are being investigated further.

Thanks and looking forward to using this package when this is resolved

esqew commented 1 year ago

@johnfeldhausen Thanks for the report. Can you give us a bit more detail around your environment? Are you running on cloud infrastructure or through a VPN?

mbrundige commented 1 year ago

@esqew - I did get this to work with the following version (upgraded from 3.10 to 3.11):

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin.

I deduce that this may be something with TLS fingerprinting in older versions of the ssl wrapper library. Here is the version of openssl (which is pertinent for the TLS fingerprint issue):

OpenSSL 1.1.1q 5 Jul 2022

esqew commented 1 year ago

@mbrundige Fascinating. I definitely missed an OpenSSL upgrade in the 3.11 RC2 release notes, but sure enough OpenSSL 1.1.1q is mentioned. I don’t doubt TLS fingerprinting plays a major role in Cloudflare heuristics, so this could indeed make a lot of sense if the changes in OpenSSL materially affect how this is carried out.

This would also track with earlier reports that a Ruby-based HTTP library didn't have issues being filtered when MechanicalSoup's requests baseline did

In any event, this would certainly preclude the need for any browser-based workarounds.

@steveroks @trludt @AVA-27 @RobMepham @Harrisoneller If possible, would each of you mind updating to the latest kenpompy and Python 3.11.0 and report back if this resolves your Cloudflare-related issues?

esqew commented 1 year ago

I will assume that if we don't hear back from anyone in the next week or so that the proposed fix (updating to Python 3.11) is working for those still otherwise experiencing this issue, and will close this issue accordingly.

I'll also add a small section to README or a Wiki page to summarize the issue & fix as a place to point future users to.

esqew commented 1 year ago

Closing due to inactivity per my previous comment. We will consider the guidance to update to Python 3.11.x as the best solution. On the flipside, I am open to more reports in the future that might contradict that, in which case we'll re-open this and do further investigation on a case-by-case basis.

Thanks to everyone for their assistance to date!