nicolas-gervais / predicting-car-price-from-scraped-data

Picture and specifications scraper
414 stars 124 forks source link

Scraping high res photos #1

Closed fcakyon closed 4 years ago

fcakyon commented 4 years ago

Hello,

Firstly, thanks for sharing such a comprehensive dataset on car models!

You have mentioned on the reddit post on scraping high res photos is possible by modifying the scraper. Would you mind telling about the element id or get request that is related with the high res photos, so that I can modify the scraper script accordingly?

Bests

nicolas-gervais commented 4 years ago

Hi,

Thanks for reaching out. Essentially, for all 2,000ish cars, I'm fetching the entire html as a string. Then, I'm using regex to return the URLs from the 150 pictures on each page.

Each picture has 3 versions:

https://images.hgmsites.net/sml/2020-porsche-911-carrera_100688947_s.jpg
https://images.hgmsites.net/med/2020-porsche-911-carrera_100688947_s.jpg
https://images.hgmsites.net/lrg/2020-porsche-911-carrera_100688947_s.jpg

So all you have to do to get the URL for the large one is change this line in scrape.py:

for ix, photo in enumerate(re.findall('sml.+?_s.jpg', fetch_pics_url)[:150], 1):

For:

for ix, photo in enumerate(re.findall('lrg.+?_l.jpg', fetch_pics_url)[:150], 1):

Some even have a /hug/ version for huge, but I noticed it isn't consistent.

Careful with your internet consumption! Initially, it scrapes like 200,000 photos, and deletes many interior pictures after. That's going to be a nearly 20GB download.

nicolas-gervais commented 4 years ago

I made a mistake. I edited it now.

fcakyon commented 4 years ago

Thanks a lot for the fast response, this is exactly what i was asking for :)