lorey / mlscraper

🤖 Scrape data from HTML websites automatically by just providing examples
https://pypi.org/project/mlscraper/
1.31k stars 89 forks source link

Feedback #19

Open jonashaag opened 2 years ago

jonashaag commented 2 years ago

Gave this a try :-)

Feedback:

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

jonas_url = "https://github.com/jonashaag"
resp = requests.get(jonas_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "name": "Jonas Haag",
        "followers": "329",  # Note that this doesn't work if 329 passed as an int.
        #'company': '@QuantCo',  # Does not work.
        "twitter": "@_jonashaag",  # Does not work without the "@".
        "username": "jonashaag",
        "nrepos": "282",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://github.com/lorey")
result = scraper.get(Page(resp.content))
print(result)
lorey commented 2 years ago

Hi Jonas, love the feedback. Thanks for taking the time. I might need to check more thoroughly, but here are some thoughts of things to be fixed/improved on my side:

lorey commented 2 years ago

Resulting issues and enhancements:

added:

lorey commented 2 years ago

Just saw that @QuantCo is @Quantco on your profile. Maybe that's also related to #18

jonashaag commented 2 years ago

Just saw that @QuantCo is @Quantco on your profile

Oops, my bad.

jonashaag commented 2 years ago

Re: pip, it installs 0.1.2 for me oO

pip install --pre mlscraper --no-deps
Collecting mlscraper
  Using cached mlscraper-0.1.2-py2.py3-none-any.whl (12 kB)
Installing collected packages: mlscraper
Successfully installed mlscraper-0.1.2
lorey commented 2 years ago

Okay, issue identified, cause still unclear. You would need the 1.0.0rc2 version.

Maybe because 1.0 is python 3.9+? If that's not it, I'm out of ideas. Just tried with docker and ubuntu-latest, worked like a charm.

jonashaag commented 2 years ago

Yep, that's the cause. User error, case closed :)

lorey commented 2 years ago

While fixing, found #23

lorey commented 2 years ago

Have added the Github profiles as a test case and re-worked training, should now work reasonably fast.

CSS selectors are flaky at times, need to find a reasonable heuristic to prefer good ones.

jonashaag commented 2 years ago

Here's another example that doesn't work in case you're looking for work :-D

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

article1_url = "https://www.spiegel.de/politik/kristina-haenel-nach-abstimmung-ueber-219a-im-bundestag-dieser-kampf-ist-vorbei-a-f3c04fb2-8126-4831-bc32-ac6c58e1e520"
resp = requests.get(article1_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "title": "»Dieser Kampf ist vorbei«",
        "subtitle": "Ärztin Kristina Hänel nach Abstimmung über 219a",
        "teaser": "Der umstrittene Paragraf zum »Werbeverbot« für Abtreibung ist seit heute Geschichte – und die Gießenerin Kristina Hänel, die seit Jahren dafür gekämpft hat, kann aufatmen. Wie geht es für die Medizinerin jetzt weiter?",
        "author": "Nike Laurenz",
        "published": "24.06.2022, 14.26 Uhr",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://www.spiegel.de/politik/deutschland/abtreibung-abschaffung-von-paragraf-219a-fuer-die-muendige-frau-kommentar-a-784cd403-f279-4124-a216-e320042d1719")
result = scraper.get(Page(resp.content))
print(result)
lorey commented 2 years ago

What does "doesn't work" mean in that context?

I think that it's impossible to get it right with one sample (and esp. for two slightly different pages). I would most likely fail to write a scraper myself just by looking at one page either.

jonashaag commented 2 years ago

It crashes (but with 1 sample only, haven’t tested more)

lorey commented 2 years ago

So regarding spiegel online, this was quite some work as articles have different layouts. Took me some major performance tweaks to get it running in a sensible amount of time without sacrificing correctness. I still have issues with missing authors because the scraper class raises an error instead of assuming None if no author is found, but that's fixable.

Issue #25

Here's the code: https://gist.github.com/lorey/fdb88d6c8e41b9b6bc8df264cffc68e1

lorey commented 2 years ago

Fixed the authors issue, now takes around 30s on my machine. Formatting by me:

INFO:root:found DictScraper (scraper_per_key={
    'published': <ValueScraper self.selector=<CssRuleSelector self.css_rule='time'>, self.extractor=<TextValueExtractor>>, 
    'subtitle': <ValueScraper self.selector=<CssRuleSelector self.css_rule='h2 .font-sansUI'>, self.extractor=<TextValueExtractor>>, 
    'title': <ValueScraper self.selector=<CssRuleSelector self.css_rule='h2 > span:nth-child(2)'>, self.extractor=<TextValueExtractor>>, 
    'teaser': <ValueScraper self.selector=<CssRuleSelector self.css_rule='meta[name="description"]'>, self.extractor=<AttributeValueExtractor self.attr='content'>>, 
    'authors': <ListScraper self.selector=<CssRuleSelector self.css_rule='header a.border-b'> self.scraper=<ValueScraper self.selector=<mlscraper.selectors.PassThroughSelector object at 0x7efda0f969a0>, self.extractor=<TextValueExtractor>>>
})

# results of newly scraped pages
{'published': '07.07.2022, 11.34 Uhr', 'subtitle': 'Absage an Forderung der Union', 'title': 'Lambrecht will keine Transportpanzer in die Ukraine liefern', 'teaser': 'CDU und CSU fordern eine kurzfristige Lieferung von 200 Fuchs-Panzern an die Ukraine. Die Bundesverteidigungsministerin erteilt dem Vorschlag eine klare Absage – mit Hinweis auf eigene Sicherheitsinteressen.', 'authors': []}
{'published': '07.07.2022, 11.32 Uhr', 'subtitle': 'Größter Vermieter Deutschlands', 'title': 'Vonovia will nachts die Heizungen herunterdrehen', 'teaser': 'Um Energie zu sparen, will Deutschlands größter Wohnungskonzern während der Nachtstunden die Vorlauftemperatur der Heizungsanlage absenken. Die Räume werden dann allenfalls noch rund 17 Grad warm.', 'authors': []}
jonashaag commented 2 years ago

Impressive work 🤩

jonashaag commented 2 years ago

Example from a commercial application, price doesn't work, anything else works great

"""
To use this:
pip install requests
pip install --pre mlscraper

To automatically build any scraper, check out https://github.com/lorey/mlscraper
"""

import logging

import requests

from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

ARTICLES = ({
    'url': 'https://www.rahm24.de/schlafen-und-wohnen/komfortmatratzen/schaumstoffmatratze-burmeier-basic-fit',
    'title': "Schaumstoffmatratze Burmeier Basic-Fit",
    #'price': '230,00 € *',
    'manufacturer': 'Burmeier',
},
    {
        'url': 'https://www.rahm24.de/medizintechnik/inhalationstherapie/inhalationsgeraet-omron-ne-c28p',
        'title': 'Inhalationsgerät Omron NE-C28P',
        #'price': '87,00 € *',
        'manufacturer': 'Omron',
    },
    {
        'url': 'https://www.rahm24.de/schlafen-und-wohnen/aufstehsessel/ruhe-und-aufstehsessel-innov-cocoon',
        'title': 'Ruhe- und Aufstehsessel Innov Cocoon',
        #'price': '1.290,00 € *',
        'manufacturer': 'Innov`Sa',
    },
)

def train_and_scrape():
    """
    This trains the scraper and scrapes two other pages.
    """
    scraper = train_medical_aid_scraper()

    urls_to_scrape = [
        'https://www.rahm24.de/pflegeprodukte/stoma/stoma-vlieskompressen-saliomed',
    ]
    for url in urls_to_scrape:
        # fetch page
        article_resp = requests.get(url)
        article_resp.raise_for_status()
        page = Page(article_resp.content)

        # extract result
        result = scraper.get(page)
        print(result)

def train_medical_aid_scraper():
    training_set = make_training_set_for_articles(ARTICLES)
    scraper = train_scraper(training_set, complexity=2)
    return scraper

def make_training_set_for_articles(articles):
    """
    This creates a training set to automatically derive selectors based on the given samples.
    """
    training_set = TrainingSet()
    for article in articles:
        # fetch page
        article_url = article['url']
        html_raw = requests.get(article_url).content
        page = Page(html_raw)

        # create and add sample
        sample = Sample(page, article)
        training_set.add_sample(sample)

    return training_set

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    train_and_scrape()
lorey commented 2 years ago

There's some weird whitespace causing issues. But it works if you change the price to a proper price in dot notation (which is hidden in the html):

ARTICLES = ({
    'url': 'https://www.rahm24.de/schlafen-und-wohnen/komfortmatratzen/schaumstoffmatratze-burmeier-basic-fit',
    'title': "Schaumstoffmatratze Burmeier Basic-Fit",
    'price': '230.00',
    'manufacturer': 'Burmeier',
},
    {
        'url': 'https://www.rahm24.de/medizintechnik/inhalationstherapie/inhalationsgeraet-omron-ne-c28p',
        'title': 'Inhalationsgerät Omron NE-C28P',
        'price': '87.00',
        'manufacturer': 'Omron',
    },
    {
        'url': 'https://www.rahm24.de/schlafen-und-wohnen/aufstehsessel/ruhe-und-aufstehsessel-innov-cocoon',
        'title': 'Ruhe- und Aufstehsessel Innov Cocoon',
        'price': '1290.00',
        'manufacturer': 'Innov`Sa',
    },
)

returns:

INFO:root:found DictScraper (scraper_per_key={'title': <ValueScraper self.selector=<CssRuleSelector self.css_rule='section header'>, self.extractor=<TextValueExtractor>>, 'manufacturer': <ValueScraper self.selector=<CssRuleSelector self.css_rule='li:nth-child(2) > span'>, self.extractor=<TextValueExtractor>>, 'price': <ValueScraper self.selector=<CssRuleSelector self.css_rule='.product--price meta'>, self.extractor=<AttributeValueExtractor self.attr='content'>>, 'url': <ValueScraper self.selector=<CssRuleSelector self.css_rule='meta[itemprop="url"]'>, self.extractor=<AttributeValueExtractor self.attr='content'>>})
lorey commented 2 years ago

I think generally this needs to be fixed by #15