Closed omaiyiwa closed 1 year ago
Hi @omaiyiwa. mlscraper uses the response you provide. You can even use a fully emulated browser with playwright. Just load the result into a Page
object afterwards and you're good to go.
For example:
expected_output_by_url = ...
training_set = TrainingSet()
for url in urls:
# some sophisticated method to fetch html here
html = load_with_super_stealth_methods()
samples.append(Sample(Page(html), expected_output_by_url[url])
training_set.add_sample(sample)
# train the scraper with the created training set
scraper = train_scraper(training_set)
Another option would be to just store the html of the page you fetch and then apply mlscraper on the stored pages. Also makes it a lot easier to debug if something goes wrong.
Even though the response.status_code is 200, can we still train the model based on the manually extracted content from a website that has anti-scraping measures? (I am a beginner)