thematticusfits / testscode

0 stars 1 forks source link

What are you trying to scrape ? #1

Open tunahorse opened 1 year ago

tunahorse commented 1 year ago

Pls provide screenshots and extract HTML you are trying to scrape.

thematticusfits commented 1 year ago

https://www.linkedin.com/company/medtronic/people/?keywords=device%20sales

CleanShot 2023-06-21 at 14 59 33

I want to grab each person's name and the link to each person's profile

CleanShot 2023-06-21 at 15 01 39

tunahorse commented 1 year ago

Okay using your code I run into 999, meaning linkedin says stop. Using header's I get redirected to the login. Two options.

Rotate IP's. (Complicated) Use the API

`import scrapy

class PeopleScraper(scrapy.Spider): name = "people_scraper" allowed_domains = ["linkedin.com"] handle_httpstatus_all = True headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" }

def start_requests(self):
    # define the start URL
    start_url = 'https://www.linkedin.com/company/medtronic/people/?keywords=device%20sales'

    # replace [COMPANY] with the company name or ID in the URL
    # you can also add additional parameters to the URL, such as "?keywords=[KEYWORD]"
    yield scrapy.Request(url=start_url, headers=self.headers, callback=self.parse)

def parse(self, response):
    # extract the HTML elements that contain the name and URL for each person
    for person in response.css('.search-result__info'):
        name = person.css('.ember-view.lt-line-clamp.lt-line-clamp--single-line.org-people-profile-card__profile-title.t-black::text').get()
        url = person.css('a::attr(href)').get()

        # clean up the data
        name = name.strip()  # remove extra whitespace
        url = response.urljoin(url)  # convert relative URL to absolute URL

        # return a dictionary with the scraped data
        yield {'name': name, 'url': url}

`

thematticusfits commented 1 year ago

had a feeling linkedin was blocking. so this is a linkedin API people can use to access site?