juliomalegria / python-craigslist

Simple Craigslist wrapper
MIT No Attribution
387 stars 117 forks source link

Error 403 - Forbidden for url: https://www.craigslist.org/about/sites #105

Open luisandrecunha opened 3 years ago

luisandrecunha commented 3 years ago

Hi Julio,

I have used your code before (early 2020), but now I'm getting the error below when trying to import CraigslistHousing, using "from craigslist import CraigslistHousing":

HTTPError: 403 Client Error: Forbidden for url: https://www.craigslist.org/about/sites

Screen Shot 2021-01-05 at 5 25 31 PM

Not sure why, it seems that could be related with this issue: https://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping.

Do you happen to know why this is happening?

Thanks,

irahorecka commented 3 years ago

Seems like this works on my end. Did you upgrade python-craigslist to the latest version? I have a feeling this issue might be agnostic of package upgrade, but it doesn't hurt..

luisandrecunha commented 3 years ago

Yep, I did the upgrade and continue to have the same issue. Using v1.1.0 and python 3.6, I'm using Google's Colab notebooks.

irahorecka commented 3 years ago

Ah, this looks to be a problem with the requests library in your environment, not python-craigslist, per se. I'm guessing the same exception would be thrown if you executed this:

import requests
requests.get("https://www.craigslist.org/about/sites")
luisandrecunha commented 3 years ago

You are completely right, I also tried in a new colab and got "<Response [403]>"

If I run the code below I get a successful response and the page code. I believe it's related with the web scraping issue in this page.

from urllib.request import Request, urlopen
req = Request('https://www.craigslist.org/about/sites', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)
juliomalegria commented 3 years ago

Thanks for reporting @luisandrecunha.

Interesting. Seems like Craigslist is blocking requests coming from your IP (or Google's Colab IPs). I'm guessing the IP hit a max number of requests per day/hour/minute.

Do you mind running the code suggested by @irahorecka but setting a User-Agent like you did with urllib:

import requests
requests.get("https://www.craigslist.org/about/sites", headers={'User-Agent': 'python-craigslist/1.1.0'})

If this works fine, I'll add a default User-Agent to all requests to prevent this from happening in the future.

Thanks!

luisandrecunha commented 3 years ago

Hi @juliomalegria ,

It seems that Google's Colab IPs is blocked by Craigslist... I successfully ran the code in a local jupyter notebook and it worked like a charm.

I tried the code you suggested in Colab and continued to get the 403 response... However I receive the right page if I use the code below, not sure if somehow the code could be adapted.

from urllib.request import Request, urlopen
req = Request('https://www.craigslist.org/about/sites', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

Thank you again,

jraVette commented 3 years ago

Just a heads up, I've got the exact same issue. I've been running my code for more than a year and this just happened this week. So, something must have changed on the craigslist side? I'll have to dig into the code. I can cut and paste the url into a browser and it works fine. Just wanted to let you know of another user with the same issues.

>>> import requests
>>> requests.get('https://boston.craigslist.org')
<Response [200]>
>>> requests.get('https://boston.craigslist.org/search')
<Response [403]>
>>> requests.get('https://boston.craigslist.org/search',headers={'User-Agent': 'XYZ/3.0'})
<Response [403]>

I tried it on a couple of computers, so I don't think it's IP related. Guess how the servers are seeing the 'requests' library versus a regular library.

Thanks!

juliomalegria commented 3 years ago

Hey everyone! Sorry for the inactivity. I've released a new version (1.1.1) adding a User-Agent to requests.get. Hopefully that will solve the issue, please report back if it does or doesn't. If it doesn't I'll have to change libraries to urllib. Thanks!

cwittwer commented 3 years ago

I am still getting the 403 error with the updated utils.py.

KeeonTabrizi commented 3 years ago

+1 Having the same behavior - 403s on /search paths through just a general requests.get() call so the library/class is also not functioning.

Also note I tried taking the headers object from the cURL to /search which loads in a regular browser and used that for the requests call which they also blocked.

I used a selenium driver I had with some mods I've used in the past and I was able to load /search just fine so I don't suspect they are doing something super sophisticated to block the request.

KeeonTabrizi commented 3 years ago

Okay I've dug into it a bit more - I don't think this has anything do to with user agents or anything they are blocking like that. I recommend upgrading both the requests and urlib3 library pip install urllib3 --upgrade pip install requests --upgrade. Once I did that things started working again. So not sure the actual issue - as older versions of those libraries were working - but with the updates it looks fine to me.

After I did that I tested the request function (which is effectively requests.get()) works:

import requests
import urllib3
from craigslist import utils

>> requests.__version__
Out[5]: '2.25.1'

>>urllib3.__version__
Out[6]: '1.26.3'

>> utils.requests_get('https://boston.craigslist.org/search')
Out[8]: <Response [200]>
juliomalegria commented 3 years ago

Thanks @KeeonTabrizi! That's a very good point. I've updated the requirements to include some minimum version for requirements (requests and beautifulsoup4). Can anyone having issues try updating their library (pip install python-craigslist --upgrade) and let me know if this fixed the issue. Thanks again!

usctzen commented 3 years ago

Hey guys.

I am not a power user, but I have found that the latest idna version is incompatible with requests. If you installed the latest idna then just run requests upgrade and it will revert the idna version. I have no clue that it could be your troubles, but it could be a factor.

Hope this helps.

Le mar. 23 févr. 2021 à 13:15, Julio M. Alegria notifications@github.com a écrit :

Thanks @KeeonTabrizi https://github.com/KeeonTabrizi! That's a very good point. I've updated the requirements to include some minimum version for requirements (requests and beautifulsoup4). Can anyone having issues try updating their library (pip install python-craigslist --upgrade) and let me know if this fixed the issue. Thanks again!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/juliomalegria/python-craigslist/issues/105#issuecomment-784158842, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXNCUNQTBYAWRMKIOJJWQDTAOL47ANCNFSM4VVVT3VQ .

jraVette commented 3 years ago

Hey y'all, thanks so much for taking the time to fix this! So, it could just be how my packages were managed, but, when I performed (pip install python-craigslist --upgrade) it updated requests but not urllib3. I guess urllib3 is used by requests. So, it did not work with just upgrading python-craigslist. But, after updating both request and urllib3 to the latest, back up and running! Maybe consider adding urllib to the requirements? Thanks again!!

These versions are what got my code working:

>>> requests.__version__
'2.25.1'
>>> urllib3.__version__
'1.26.3'

PS. great module, it's helped me get some great deals on Craiglist!

cwittwer commented 3 years ago

Hey y'all, thanks so much for taking the time to fix this! So, it could just be how my packages were managed, but, when I performed (pip install python-craigslist --upgrade) it updated requests but not urllib3. I guess urllib3 is used by requests. So, it did not work with just upgrading python-craigslist. But, after updating both request and urllib3 to the latest, back up and running! Maybe consider adding urllib to the requirements? Thanks again!!

These versions are what got my code working:

>>> requests.__version__
'2.25.1'
>>> urllib3.__version__
'1.26.3'

PS. great module, it's helped me get some great deals on Craiglist!

+1 this fixed everything. Good catch!

irahorecka commented 3 years ago

@cwittwer, @jraVette, @usctzen, @KeeonTabrizi, @luisandrecunha If you guys are interested in a new Craigslist API format, check out pycraigslist. I enjoy python-craigslist, but there were some features I wanted to implement immediately. Some additional features are in the works.

usctzen commented 3 years ago

Thanks, I'll check it out.

Le mar. 30 mars 2021 à 18:42, Ira Horecka @.***> a écrit :

@cwittwer https://github.com/cwittwer, @jraVette https://github.com/jraVette, @usctzen https://github.com/usctzen, @KeeonTabrizi https://github.com/KeeonTabrizi, @luisandrecunha https://github.com/luisandrecunha If you guys are interested in a new Craigslist API format, check out pycraigslist https://github.com/irahorecka/pycraigslist. I enjoy python-craigslist, but there were some features I wanted to implement immediately.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliomalegria/python-craigslist/issues/105#issuecomment-810412335, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXNCULAQFAEV7YGYKK2MNDTGH5OTANCNFSM4VVVT3VQ .

usctzen commented 3 years ago

Ira,

Just gave it a quick try and I am getting an error. The script finds the forsale.mca but does not recognize the forsale.mcy mca is motorcycle all and mcy is motorcycles by owner.

Traceback (most recent call last): File "C:/Users/mgpd/PycharmProjects/molivo/py_clist.py", line 3, in print(pycraigslist.forsale.mcy.get_filters())AttributeError: type object 'forsale' has no attribute 'mcy'

Marc @usctzen

Le mar. 30 mars 2021 à 18:42, Ira Horecka @.***> a écrit :

@cwittwer https://github.com/cwittwer, @jraVette https://github.com/jraVette, @usctzen https://github.com/usctzen, @KeeonTabrizi https://github.com/KeeonTabrizi, @luisandrecunha https://github.com/luisandrecunha If you guys are interested in a new Craigslist API format, check out pycraigslist https://github.com/irahorecka/pycraigslist. I enjoy python-craigslist, but there were some features I wanted to implement immediately.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliomalegria/python-craigslist/issues/105#issuecomment-810412335, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXNCULAQFAEV7YGYKK2MNDTGH5OTANCNFSM4VVVT3VQ .

irahorecka commented 3 years ago

Hey @usctzen, I always appreciate your feedback. Could you post the same issue in pycraigslist issues? I’ll address it there :)

usctzen commented 3 years ago

Sure thing!

Le mar. 30 mars 2021 à 21:36, Ira Horecka @.***> a écrit :

Hey @usctzen https://github.com/usctzen, I always appreciate your feedback. Could you post the same issue in pycraigslist issues? I’ll address it there :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliomalegria/python-craigslist/issues/105#issuecomment-810525225, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXNCUOF36FRDJWJCHYHW23TGIR3FANCNFSM4VVVT3VQ .

juliomalegria commented 3 years ago

Hey everyone! Sorry for the delay, I've updated the requirements in 88a6b73 and pushed a new version in PyPI. Could anyone confirm if the issue is fixed with this? Thanks for all the patience!

Agwebberley commented 1 year ago

I am still having this issue