lorey / mlscraper

🤖 Scrape data from HTML websites automatically by just providing examples
https://pypi.org/project/mlscraper/
1.31k stars 89 forks source link

Is it possible to handle anti-scraping measures? #42

Closed omaiyiwa closed 1 year ago

omaiyiwa commented 1 year ago

Even though the response.status_code is 200, can we still train the model based on the manually extracted content from a website that has anti-scraping measures? (I am a beginner)

lorey commented 1 year ago

Hi @omaiyiwa. mlscraper uses the response you provide. You can even use a fully emulated browser with playwright. Just load the result into a Page object afterwards and you're good to go.

lorey commented 1 year ago

For example:

expected_output_by_url = ...

training_set = TrainingSet()
for url in urls:
    # some sophisticated method to fetch html here
    html = load_with_super_stealth_methods()
    samples.append(Sample(Page(html), expected_output_by_url[url])

training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

Another option would be to just store the html of the page you fetch and then apply mlscraper on the stored pages. Also makes it a lot easier to debug if something goes wrong.