j-andrews7 / kenpompy

A simple yet comprehensive web scraper for kenpom.com.
https://kenpompy.readthedocs.io/en/latest/?badge=latest
GNU General Public License v3.0
73 stars 21 forks source link

Refactor for new cloudflare requirements #95

Closed seankim658 closed 3 weeks ago

seankim658 commented 4 weeks ago

Refactors to use the cloudscraper library instead of mechanicalsoup. Fixes issue #93.

seankim658 commented 4 weeks ago

I just ran the tests and got 3 failures out of the 24 test cases. Let me look into that quick and I'll make a new commit to correct whatever went wrong.

seankim658 commented 4 weeks ago

Ok passes all test cases now.

j-andrews7 commented 4 weeks ago

Generally looks pretty good to me, I'll kick the wheels a bit further when I get a chance. My one worry is that it (unsurprisingly) does away with some of the Cloudflare interception handling to let the user know what's going wrong.

...Of course all of that does nobody any good if there's no way past them anyway, which seems like it might currently be the case. And we can always add some of that back in the future.

@esqew may have some additional thoughts.

seankim658 commented 4 weeks ago

I realized I didn't include this info in the issue but the Cloudflare JS challenge I was running into in #93 wasn't being caught by the interception handling. The login function was failing here:

Traceback (most recent call last):
  File "/home/seank/projects/personal/kenpom-upstream/kenpompy/main.py", line 8, in <module>
    scraper = login(username, password)
  File "/home/seank/projects/personal/kenpom-upstream/kenpompy/kenpompy/utils.py", line 36, in login
    browser.select_form('form[action="handlers/login_handler.php"]')
  File "/home/seank/.local/lib/python3.10/site-packages/mechanicalsoup/stateful_browser.py", line 241, in select_form
    raise LinkNotFoundError()
mechanicalsoup.utils.LinkNotFoundError

So the Cloudflare JS detection HTML was being returned in front of the kenpom home page and then the select_form call was failing.

The HTML returned indicated the challenge-error-text to enable Javascript:

<!DOCTYPE html>
<html lang="en-US">
    <head>
        <title>Just a moment...</title>
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
        <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
        <meta content="noindex,nofollow" name="robots" />
        <meta content="width=device-width,initial-scale=1" name="viewport" />
        <style>
            * {
                box-sizing: border-box;
                margin: 0;
                padding: 0;
            }
            html {
                line-height: 1.15;
                -webkit-text-size-adjust: 100%;
                color: #313131;
                font-family: system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Arial, Noto Sans, sans-serif, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;
            }
            body {
                display: flex;
                flex-direction: column;
                height: 100vh;
                min-height: 100vh;
            }
            .main-content {
                margin: 8rem auto;
                max-width: 60rem;
                padding-left: 1.5rem;
            }
            @media (width <= 720px) {
                .main-content {
                    margin-top: 4rem;
                }
            }
            .h2 {
                font-size: 1.5rem;
                font-weight: 500;
                line-height: 2.25rem;
            }
            @media (width <= 720px) {
                .h2 {
                    font-size: 1.25rem;
                    line-height: 1.5rem;
                }
            }
            #challenge-error-text {
                background-image: url(data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIzMiIgaGVpZ2h0PSIzMiIgZmlsbD0ibm9uZSI+PHBhdGggZmlsbD0iI0IyMEYwMyIgZD0iTTE2IDNhMTMgMTMgMCAxIDAgMTMgMTNBMTMuMDE1IDEzLjAxNSAwIDAgMCAxNiAzbTAgMjRhMTEgMTEgMCAxIDEgMTEtMTEgMTEuMDEgMTEuMDEgMCAwIDEtMTEgMTEiLz48cGF0aCBmaWxsPSIjQjIwRjAzIiBkPSJNMTcuMDM4IDE4LjYxNUgxNC44N0wxNC41NjMgOS41aDIuNzgzem0tMS4wODQgMS40MjdxLjY2IDAgMS4wNTcuMzg4LjQwNy4zODkuNDA3Ljk5NCAwIC41OTYtLjQwNy45ODQtLjM5Ny4zOS0xLjA1Ny4zODktLjY1IDAtMS4wNTYtLjM4OS0uMzk4LS4zODktLjM5OC0uOTg0IDAtLjU5Ny4zOTgtLjk4NS40MDYtLjM5NyAxLjA1Ni0uMzk3Ii8+PC9zdmc+);
                background-repeat: no-repeat;
                background-size: contain;
                padding-left: 34px;
            }
            @media (prefers-color-scheme: dark) {
                body {
                    background-color: #222;
                    color: #d9d9d9;
                }
            }
        </style>
        <meta content="390" http-equiv="refresh" />
    </head>
    <body class="no-js">
        <div class="main-wrapper" role="main">
            <div class="main-content">
                <noscript>
                    <div class="h2"><span id="challenge-error-text">Enable JavaScript and cookies to continue</span></div>
                </noscript>
            </div>
        </div>
        <script>
            (function () {
                window._cf_chl_opt = {
                    cvId: "3",
                    cZone: "kenpom.com",
                    cType: "managed",
                    cRay: "8d841b98afcc0650",
                    cH: "T_1E.n3BlTUPaWeR77C4iA5VLxJW_GzGiNYj8PkU2RE-1729879243-1.2.1.1-Eh9hYE8REn2iGpR7aKdPJOAjd9tkzHyN8ZvqPi2XJM1iO9vzKWs6tHY9pEdP.0Gk",
                    cUPMDTk: "\/index.php?__cf_chl_tk=BUZVra4deaH_JcZ4B0YBDzMgl82Cy1pgw0rcIGkKlFI-1729879243-1.0.1.1-7DuWua2qk_4BRNqhajYekpxfhY3qrdz9GTCphNZww3E",
                    cFPWv: "b",
                    cITimeS: "1729879243",
                    cTTimeMs: "1000",
                    cMTimeMs: "390000",
                    cTplV: 5,
                    cTplB: "cf",
                    cK: "",
                    fa: "\/index.php?__cf_chl_f_tk=BUZVra4deaH_JcZ4B0YBDzMgl82Cy1pgw0rcIGkKlFI-1729879243-1.0.1.1-7DuWua2qk_4BRNqhajYekpxfhY3qrdz9GTCphNZww3E",
                    md:
                        "bWf7BwCOJJsjIpVmb3yoY5GNweGi17ilxtGZxhT8zis-1729879243-1.2.1.1-bzn3vH.0VTcwhxQEHn.3hbBNmFhQ6d.9eeG2YooSc4vQWBRULQQY2iqWLZJGH7bLNjcPLSmbX8oM0gyROGCvui26J.CeivFroIA9xLxzwHUqcslRCKmHhsVAD_UiZpQDctv5va0JKh8h6JfkcTMyk7c9u4P5bapI78_8u1DdnGN6ZpiLnqg7oti8ORdHPRLoat63Fc77KlSHJrP4eVYL3wtk8sVxYA983fk9hsOM5f1_hg.z4TkuyZ2bWE9esM24ouV_IV.V7k7vxGfX0BsY8SFVBfpvFyzkDCuae3j9mbwzDG0ZQlN.h3HqoE3h56Ud.XYS8xlwML3lVQ_sDKof2YeekrEOTAeWakBtDCDm5JaYhZi9RlHfealM5uKqgxfs2.oV3DIbaBN6L6IMlrzm46v9XdL1bkqtTftK_lv4HLp7ONyBxMKNm36BBVQc_KgHMQqbsZYL27uBWbqOU4a7Ozi.TtakdFpb02aImifPkrxrzHruE23B5thSnOEWcb2fQauW8EQvX1_CTBwipW6KfsoOxhMVWp.1dYhs922ohFYK_FLgKVXl6csZ5M.kEP4xNt0bFZ5kgmeNkFWVqJuUbqk0WKDFk3UymA4xYcJGxag34Wmx7Jz2ZMLZEVwtgJ26mjSWZBBGnMCO6bpiuAOSPXlF4UumStfdZBQ3fyd4PxtLNSctGQPgAF6otQQJkG_7UJ0RM3bvbn.czTx_KMHMox2uYHX4H0Ue_9a3OKjjznEAHmn0BA91UkqQg5SNYS2752Yed0_TkeEBlAk4.zhePOtK999ZlH8QSJcDWJDOtVdsLaB6eT_q.DpDION1wb1yD4bthm_At.iWWS.Ij8o549AlhZ1xVG4v6mqnCCDe.WwDnG69N7RQXD1s1m2ApSEqvFmFJ27U_6YEk.opnOd_3H_YjLbNCHXoOhnrL4Foh7.XCPHnGxqRoysXb7ABS95qL94h.w7DbJsKlKj1PT2f1FqaRGEf1.RM9YYjnJoZEYE98ufvpQqW4ncAKrvo2QKY1S_EH1af30pIXRZbtC6NRuAjqNP7TMM.L_QOYDkZ56JVqygZFUS_9AV9pIZp6mMcUcHLwlalbKwPSvo77YJO8w3s3GGZ05VMcBEwea1Mc_eYIUey57MyZNoskjs84i7xwKqBdnOn.uk3bcEORssrMiLLu8z2qO2R98Sq9E20VdBM52mDKete_Ve93tH7E4FQ048C1MogrY6FckQTFwZi0yc0VumLU_FD_sCuMgDyRglTjckQ3oEroenDhzz3rr9C.Io3PGZwHAhpG75v1YYL0NOoPuWQn86w71UaNQ_1kCQpgz4nlHOacwQR5oZuypv1eXMOmJyJjHL12Z.X7RGKeaBHsnGGd4bZpThoW47cmefJM4_BAggs4gZnP3KQTjH.hhMo9PlN4iEwNgzTkcAUVZ9Ho_DccAjHf1kmhbFqtqodOcw5de5Me1N2m8JLIOdnjyP__HE5RQ3mKdO0Wnl0W08Kz.cKGFQebXyJ_h2Dt4jQ41ETjFPzW5Q0c2BmM7NSxSaWNFL5FA01xR07X3zs5zLYyarvrPPATl0rMfXtuJLBqu3NNTmvqZjAp5GdZTGPOTQvLHckMrePh9fyJAccIrCQJkS1yEVgzaVRC9E0T6wDdGHCk8ioRKbh_0owOkMiPAqH29qt5afV2xLY7naUJAJPYZjtTckEmli1NB2pdkjM5mLyHorlnRRHOn7PY_1n3_O2zzpuGByi.v0GqsZIcr6lYfV8mw3nWQQa5tPtJP5asDpj_O.I7pT5osuPU9o2eU_wos1wq10okYEVMt4CW8I0hoZxnGlFiMKdVYS1_OThKTiRagRZ3Fippm7pBgZQMN2xg_uTxnuWzeRcmxm0NFESZ0Hp.sERjPet9_zWbsZOUN_NA8Vn_p8YzLix9lAzlIJjneFYckwsz_KaANqbSfXgHW7nncNjCrgFsIkD8HyY6Fnxa74E7GJIzz5R8LVXaZqOzozci_x1QA4C3pX9ckwDpoRaFAp_ZtObR0t5CfYR6PR.LM9WpKNgwtjQb2z1KVHWnUejCIeN0glZFdAGHmlahzTsJrxooy4fBZ2Xz106ObyePUkPrOckky5t3Rwe9Osx6ESBHSCNdjxW.gf9pcQSlTyJyioVOi82tBRSNesuXgY3qXOlTaTZtsMhBYoNIl7hwuUG9Vp39XL1AMaYRA",
                    mdrd:
                        "thZe5pKCvBkMZq9Ow_naSE..nky8ZGAFvygi9IO4Yvo-1729879243-1.2.1.1-hPCirwZKUhXnaWj5lH43K8h5sU4w5RBMjt5h9igoG2Qst_q6l8K4b2RqLa7v8GypM9N0S.FRZA1a5vXdRBuZJ2yaVGWA7jhDnbA.9.lacBe0_qGoNQYk1aSjOTmaTsHZrgcuDaPfBWstsCeRfaH_wRDoSDcCv00N4WwOSWTbJU4tMoOGYkel1wk._ax4rGtQv5quvH5cg0UPOhE7A7NFMaBjTE9FS6iTlTr9U9wrONpF7JhBbXrT4Yr940eqZRL.sqqticT4yizKONpULzODe4RAlWvo70yyMlGnyNblBHTgt6312OAzw8g671PUVBDmdAAUS1ystjor9KeH7liUqYSfwO8RcPzszjYJZLQfTCA57Nj.1XVFExqw.fBnomeCcg6KFX97oxYe1R2tc8qDaJYA3WKeGWCoT3V_qmc4JDfOeRdeP6hJYhZIJSVeT8t._tcBmYQPbnY0N4to4mc3G4_T7qK.qu6DyogYfViFT9evSAyk57._2vEEvi0wsP.ECCJrtpiNspuOyxS4jepzjUTY.HgTBI842mN_jX_06Oauag_8AylYFh63jypUbritD9gwKvnPmWtfjSLIlZIAiOKlA_mMZeF5imdL.GepxB7FCmA7XgKkTaEWfwZN3qCg6exb9rTFp4ey2Ia1LDNUC2wCHpXsk3lLIO9ZJqYG7VA0GTKVH0glBVdcp8GoyAfr8qSeqk0B86HP9sPCsiqcRteBxYYnopUrAiR54XdvSi5pyji4mfaOem.YuftpKi1AW3O93p7J_IOC.YFLJ5DdIJ4dZJskyfZNiT2hwjtJl7ujO1h71lZGR8jKXNJFuripZDPE198kYQNhpy5CMrNRbB6TuRBYujAlBVjDYrpBwT4fSeomC3Cfneywhh0aVWtx6EsHbFLpAm0UxiuWOHBAEAUFghUU9.JYNUPWKdORqBUrU5AP_hccl1shEnGSFXg.RqOvtX5yoZu828Pn3YKQxVU2QBm0LUrdwu4sYKDlodoT65moJeRQ2ynqST8aGrpgAaun9lHvkxJ85Cg5a.aic3J3JZLqv3K3QpkuQ9dxmAsj8PDJBEotayzTi44txPJOAUOyqIV_1vNmXTVSVHisQuY8hZB4ip4jNBJvWTz7_5NEXQTjmpWy6omTkBxBCoBWVLABYKXmoD21dKEI55pF7atRF3CuV7kDEG9PCyxQbhdgLRlSM3fbMtNlCXo0jcNTEDyP9NzAz0fjjmQVuwzp.b5b55W3ZUOS6W6ZIQ4zHypoMRDe6BGGS_XfeUCpfq70iwSEbk_xmtOjQkKt5mgUwVRDzOjLSzDl4gG2rVrQEWdaRybwhK8SoVOFbVJ00bkgPpxGI56lqNnxqqgkpcbf0W_kZRQghYBIPvgAT9HW9AuFOIkVe03gYCrE3gv.LzCuV8tyOrU3h4MPf5pRmIIP_wFyoTvDoNSWuoL5Vz.3HHRJLw6sSlXYJnuWUYgmzy7JzVhY9MvCILIpU61n.VIYMIj5M2sO1aJ0J5ZxdURUcCzfWQWQdeC.AjMUKFk_ahPC7griNrZg6WT5L3bk7I7xYEpJrqMh35xXUJhFjA3eoOlBDBmNhv8WUInOuOLSprpaj4ywyUIdBe5prmm2MbZOt.YXtBc2TZpAVdDa_VGzZ817CkENLU9qaxACZ0e9HxeaZTg_m6dW7UuXeqmUG5r4mqT_wgvEGMoaw2DNMXgumt5BjCgFPjWGfBC8GueqnfwyYked6yfG7OFwVl3dJduV3yEv894rL_UKxZRVSBQrx4ciwH1YGWLn8YfsM99MDx0u8tWTbKeIqa1sSY2x4Zjda4MPPJJnfbkOkcKsBByY6dduLQe_KituEJsidxMV.SJMmwz9QLVBIIHi2NbIb.7hj6qY.g0TEEOtUb0dA0EsTLocchmIQnr4zFx4VGFsNZ3TjYy.WLB14qqnK_mpO6pVnh4v7wZYqhY3T.ERm3jtXimja.EEtf_f0Ba3.7zQywdV6BaZF50OK0QrtZ_DfEQiaGAWqG401IXm_42eo5.BfWCxz2nhAjqgsyWOnh29jLAi7U2ubcstveSVRBDlCac6f5A1_H4pTVH8p7eckui3iEVbFyvWcs3I8teq9QCKD1Cs5P2wMCJ0u4EWILAVnUCxC8MZKpDZAhZ9kZMpKhLRpbs62XB9iFEzGcZknuS0p33iupXcjtueui5fs9qmskzjyg",
                };
                var cpo = document.createElement("script");
                cpo.src = "/cdn-cgi/challenge-platform/h/b/orchestrate/chl_page/v1?ray=8d841b98afcc0650";
                window._cf_chl_opt.cOgUHash = location.hash === "" && location.href.indexOf("#") !== -1 ? "#" : location.hash;
                window._cf_chl_opt.cOgUQuery = location.search === "" && location.href.slice(0, location.href.length - window._cf_chl_opt.cOgUHash.length).indexOf("?") !== -1 ? "?" : location.search;
                if (window.history && window.history.replaceState) {
                    var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;
                    history.replaceState(null, null, "\/index.php?__cf_chl_rt_tk=BUZVra4deaH_JcZ4B0YBDzMgl82Cy1pgw0rcIGkKlFI-1729879243-1.0.1.1-7DuWua2qk_4BRNqhajYekpxfhY3qrdz9GTCphNZww3E" + window._cf_chl_opt.cOgUHash);
                    cpo.onload = function () {
                        history.replaceState(null, null, ogU);
                    };
                }
                document.getElementsByTagName("head")[0].appendChild(cpo);
            })();
        </script>
    </body>
</html>
esqew commented 4 weeks ago

Thanks for this PR! I'd like to take a bit of a deeper dive on the issue itself and this PR itself before approving, but preliminarily I don’t have an issue with this.

j-andrews7 commented 3 weeks ago

Went ahead and rebased this to merge into a new v0.4.0 branch since its a bit more of a fundamental change. If we get this rolled in, fix #92, #94, and #90, I'll be pretty happy with it and push a new version to pypi.

seankim658 commented 3 weeks ago

Once this is rolled into the v0.4.0, happy to open a PR for #92. I can also take care of #90.

In building the sphinx docs locally, everything worked fine so not sure without some more info on what is going wrong there to debug.

seankim658 commented 3 weeks ago

Just tried re-running the test cases locally and they passed. One thing I remember is that over the Summer I noticed that when running inside any type of non-local environment I would get blocked. I know for sure trying to run the login function from inside a docker container gets blocked so wondering if this is a similar issue. I can look into it this week and at the very least, add some interception handling for a more descriptive error message.

j-andrews7 commented 3 weeks ago

I believe the issue in this case is that the repo secrets for actions are blocked as this is being run from a fork, and Github blocks the secrets being exposed or used for security reasons.

Regardless, I am going to merge this for now so we can deal with the other issues.