jkreucher / ReicheltAPI

A very simple Python web scraping module for Reichelt Elektronik
GNU General Public License v3.0
10 stars 1 forks source link

:bug: Search requests blocked by Reichelt? #1

Open penguineer opened 1 year ago

penguineer commented 1 year ago

When using the tool, I get the following response:

Traceback (most recent call last):
File "/home/tux/tmp/ReicheltAPI/reichelt.py", line 105, in <module>
result = app.search_part(sys.argv[1])
File "/home/tux/tmp/ReicheltAPI/reichelt.py", line 79, in search_part
results = self.get_search_results(part)
File "/home/tux/tmp/ReicheltAPI/reichelt.py", line 15, in get_search_results
website = urllib.request.urlopen(link).read().decode('utf-8')
File "/usr/lib/python3.9/urllib/request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.9/urllib/request.py", line 523, in open
response = meth(req, response)
File "/usr/lib/python3.9/urllib/request.py", line 632, in http_response
response = self.parent.error(
File "/usr/lib/python3.9/urllib/request.py", line 561, in error
return self._call_chain(*args)
File "/usr/lib/python3.9/urllib/request.py", line 494, in _call_chain
result = func(*args)
File "/usr/lib/python3.9/urllib/request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Temporarily Unavailable

Which is unfortunate, since I have the same problem in my tool and was hoping to find a solution here.

From another debug session, I could extract this response:

Server: myracloud
Date: Thu, 05 Jan 2023 12:49:36 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Cache-Control: no-cache, no-store, max-age=0
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-Xss-Protection: 1; mode=block
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

Seems that there is some additional protection now?

Curl, on the other hand, works:

curl -X POST -d "SEARCH: 1N4148" https://www.reichelt.de/index.html?ACTION=446&LA=0
Hu1buerger commented 1 year ago

Confirming the behavior

Radyl commented 5 months ago

It can be fixed by modifying the request headers:

    req_url = f"https://www.reichelt.de/index.html?ACTION=446&LA=0&nbc=1&q={ keyword.lower().replace(' ', '%20') }"
    req_headers = {
        'Accept-Language': 'de,en-US;q=0.7,en;q=0.3'
    }
    website = requests.get(req_url, headers=req_headers).content.decode('utf-8')

I used python requests here but there surely is a way to change headers with urlopen()