shedloadofcode / shedloadofcode-comments

Comments repository for shedloadofcode.com - for use with Utterances
0 stars 0 forks source link

blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models #11

Open utterances-bot opened 3 months ago

utterances-bot commented 3 months ago

How to scrape AutoTrader with Python and Selenium to search for multiple makes and models | Shedload Of Code

Take this new AutoTrader UK web scraper for a spin! It can search for and filter multiple makes and models to help you easily compare and make the right decision quicker.

https://www.shedloadofcode.com/blog/how-to-scrape-autotrader-with-python-and-selenium-to-search-for-multiple-makes-and-models/

lionelfernandes commented 3 months ago

I guess they have changed the site's html again?

shedloadofcode commented 3 months ago

Yes, it looks like there was a small change, the "Page 1 of 3" at the top of the page changed when launched in WebDriver so the scraper could not identify the maximum number of pages. I will be adding a fix today in the number_of_pages line to target the next page arrow element at the bottom of the page instead 😊

Huwee2 commented 2 months ago

Nice!

To work with the latest version of xlsxwriter, I've updated: writer.save() to: writer.close()

In the case where there aren't multiple pages, I've updated this a little: try: pagination_next_element = content.find("a", attrs={"data-testid": "pagination-next"}) if pagination_next_element is not None: number_of_pages = pagination_next_element.get("aria-label")[-1] else: number_of_pages = 1

I've also specified a specific trim with: "&aggregatedTrim={car['trim']}&"

And in the definition of cars, you can just leave the trim as an empty string if you aren't specifying

Huwee2 commented 2 months ago

Also can be handy to have output to Excel is the "subtitle" of the ad: try: subtitle = article.find("p", attrs={"data-testid": "search-listing-subtitle"}).text details["variant"] = subtitle except: print("Subtitle not found.")

Provides details on the variant (e.g. if you're choosing electric, what the kWh of the battery is)

R2Squared commented 3 days ago

This is great, thank you for your work on this.

I tried the updated code but the max pages is still hampering me - can only scrape a few pages max.

R2Squared commented 2 days ago

this code fixed the problem for me: try: pagination_next_element = content.find("a", attrs={"data-testid": "pagination-next"}) aria_label = pagination_next_element.get("aria-label") number_of_pages = int(re.search(r'of (\d+)', aria_label).group(1)) except AttributeError: print("No results found or couldn't determine number of pages.") continue except Exception as e: print(f"An error occurred while determining number of pages: {e}") continue