lorey / mlscraper

🤖 Scrape data from HTML websites automatically by just providing examples
https://pypi.org/project/mlscraper/
1.32k stars 90 forks source link

Find and fix issue with github profile pages #23

Closed lorey closed 2 years ago

lorey commented 2 years ago
lorey commented 2 years ago
lorey commented 2 years ago

Test Case added in c3427b79a09d4ea4595ab775f8c267364975b60c

jonashaag commented 2 years ago

This is still broken for me, what am I doing that's different from your test case?

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

jonas_url = "https://github.com/jonashaag"
resp = requests.get(jonas_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "name": "Jonas Haag",
        "followers": "329",
        "company": "@Quantco",
        "twitter": "@_jonashaag",
        "username": "jonashaag",
        "nrepos": "282",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://github.com/lorey")
result = scraper.get(Page(resp.content))
print(result)
jonashaag commented 2 years ago

Are you testing with a logged-in HTML dump maybe?