nrsyed / proboards-scraper

Package demonstrating how to web scrape a ProBoards forum
https://nrsyed.github.io/proboards-scraper
MIT License
3 stars 4 forks source link

AttributeError when attempting to scrape forum on the boards.net domain #43

Closed wertercatt closed 1 year ago

wertercatt commented 2 years ago

[wertercatt@wertserv proboards-scraper]$ pbs https://letssosl.boards.net Traceback (most recent call last): File "/home/wertercatt/.local/bin/pbs", line 8, in sys.exit(pbs_cli()) File "/home/wertercatt/.local/lib/python3.10/site-packages/proboards_scraper/main.py", line 115, in pbs_cli proboards_scraper.run_scraper( File "/home/wertercatt/.local/lib/python3.10/site-packages/proboards_scraper/core.py", line 102, in run_scraper base_url, url_path = split_url(url) File "/home/wertercatt/.local/lib/python3.10/site-packages/proboards_scraper/scraper/utils.py", line 46, in split_url base_url, path = match.groups() AttributeError: 'NoneType' object has no attribute 'groups'

nrsyed commented 2 years ago

Thanks, the regex in split_url currently assumes a .com TLD, so maybe it would make sense to grab the TLD from the URL string instead of hardcoding it like it's currently done in https://github.com/nrsyed/proboards-scraper/blob/main/proboards_scraper/scraper/utils.py#L44:

    expr = r"(^.*\.com)(/.*)?$"
psbdmp commented 2 years ago

I am encountering the same error, how do I fix it? Thank you.

nrsyed commented 2 years ago

I've pushed a branch, dev/url_domain_fix (#44), that should address the issue. Please test it and let me know if it works.

psbdmp commented 2 years ago

I've pushed a branch, dev/url_domain_fix (#44), that should address the issue. Please test it and let me know if it works.

Thanks. I'm now receiving the following error:

`[05:37:16][INFO][proboards_scraper.core] Logging in to https://kittenswork.boards.net [1025/053726.514:INFO:CONSOLE(93)] "Uncaught ReferenceError: proboards is not defined", source: https://kittenswork.boards.net/ (93) [1025/053726.555:INFO:CONSOLE(55)] "Uncaught ReferenceError: $ is not defined", source: https://kittenswork.boards.net/ (55) [1025/053727.878:INFO:CONSOLE(0)] "Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'.", source: (0) [1025/053728.146:INFO:CONSOLE(3)] "recaptchacompat disabled", source: https://cloudflare.hcaptcha.com/1/api.js?endpoint=https%3A%2F%2Fcloudflare.hcaptcha.com&assethost=https%3A%2F%2Fcf-assets.hcaptcha.com&imghost=https%3A%2F%2Fcf-imgs.hcaptcha.com&render=explicit&recaptchacompat=off&onload=_cf_chl_hload (3) Traceback (most recent call last): File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Python310\Scripts\pbs.exe__main__.py", line 7, in File "C:\Python310\lib\site-packages\proboards_scraper__main__.py", line 115, in pbs_cli proboards_scraper.run_scraper( File "C:\Python310\lib\site-packages\proboards_scraper\core.py", line 107, in run_scraper cookies = get_login_cookies( File "C:\Python310\lib\site-packages\proboards_scraper\http_requests.py", line 101, in get_login_cookies email_input.send_keys(username) AttributeError: 'NoneType' object has no attribute 'send_keys'

C:\Users\jack_\proboards-scraper>[1025/053729.869:INFO:CONSOLE(3)] "Request for the Private Access Token challenge.", source: (3) [1025/053729.870:INFO:CONSOLE(3)] "The next request for the Private Access Token challenge may return a 401 and show a warning in console.", source: (3) [1025/053729.894:INFO:CONSOLE(3)] "console.groupEnd", source: (3)`

psbdmp commented 2 years ago

I'm sorry this is my first github issue and it's continuing further:

C:\Users\jack_\proboards-scraper>[1025/054120.126:INFO:CONSOLE(0)] "Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'.", source: (0) [1025/054120.269:INFO:CONSOLE(3)] "recaptchacompat disabled", source: https://cloudflare.hcaptcha.com/1/api.js?endpoint=https%3A%2F%2Fcloudflare.hcaptcha.com&assethost=https%3A%2F%2Fcf-assets.hcaptcha.com&imghost=https%3A%2F%2Fcf-imgs.hcaptcha.com&render=explicit&recaptchacompat=off&onload=_cf_chl_hload (3) [1025/054122.107:INFO:CONSOLE(3)] "Request for the Private Access Token challenge.", source: (3) [1025/054122.107:INFO:CONSOLE(3)] "The next request for the Private Access Token challenge may return a 401 and show a warning in console.", source: (3) [1025/054122.138:INFO:CONSOLE(3)] "console.groupEnd", source: (3) [1025/054123.588:INFO:CONSOLE(0)] "[.WebGL-000033280328A200]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels", source: https://cf-assets.hcaptcha.com/captcha/v1/1f7dc62/static/hcaptcha.html#frame=challenge&id=08788u458bne&host=login.proboards.com&sentry=true&reportapi=https%3A%2F%2Faccounts.hcaptcha.com&recaptchacompat=off&custom=false&endpoint=https%3A%2F%2Fcloudflare.hcaptcha.com&hl=en&assethost=https%3A%2F%2Fcf-assets.hcaptcha.com&imghost=https%3A%2F%2Fcf-imgs.hcaptcha.com&tplinks=on&sitekey=33f96e6a-38cd-421b-bb68-7806e1764460&theme=light (0) [1025/054123.594:INFO:CONSOLE(0)] "[.WebGL-00003328044F0000]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels", source: https://cf-assets.hcaptcha.com/captcha/v1/1f7dc62/static/hcaptcha.html#frame=challenge&id=16qga3x0cfxk&host=login.proboards.com&sentry=true&reportapi=https%3A%2F%2Faccounts.hcaptcha.com&recaptchacompat=off&custom=false&endpoint=https%3A%2F%2Fcloudflare.hcaptcha.com&hl=en&assethost=https%3A%2F%2Fcf-assets.hcaptcha.com&imghost=https%3A%2F%2Fcf-imgs.hcaptcha.com&tplinks=on&sitekey=33f96e6a-38cd-421b-bb68-7806e1764460&theme=light (0)

wertercatt commented 1 year ago

Works for me if I don't try to log in. I think that's a separate issue though. Might need a way to import cookies?

wertercatt commented 1 year ago

Closing this issue, other person's problems aren't related.