michaelhelmick / lassie

Web Content Retrieval for Humans™
https://lassie.readthedocs.org
MIT License
614 stars 49 forks source link

Can't get the full article. #42

Closed yaseenox-personal closed 8 years ago

yaseenox-personal commented 8 years ago

Hi, I want to extract the article from the source url. I got only the title of the article and small parts of it under the "description" parameter.

michaelhelmick commented 8 years ago

Which url are you trying to parse?

yaseenox-personal commented 8 years ago

I tried more than one article, this one for example: 'http://www.bbc.com/news/business-37618618'

michaelhelmick commented 8 years ago

Hm, it looks like they're setting meta description as well as og:description and twitter:description

You won't be able to extract the article with lassie but you will be able to get the description. I'll see if I can fix this.

michaelhelmick commented 8 years ago

This is fixed, version 0.8.3 is now available on pypi!

The site you posted now returns:

{
    'site_name': u 'BBC News',
    'description': u 'Samsung has ceased production of its Galaxy Note 7 smartphones after reports of devices it had deemed safe catching fire.',
    'videos': [],
    'title': u 'Samsung permanently stops Galaxy Note 7 production',
    'url': u 'http://www.bbc.com/news/business-37618618',
    'status_code': 200,
    'locale': u 'en_GB',
    'images': [{
        'src': 'http://www.bbc.com/news/business-37618618',
        'height': None,
        'width': None
    }, {
        'src': u 'http://ichef.bbci.co.uk/news/1024/cpsprodpb/577B/production/_91759322_h2h1xuaj.jpg',
        'type': u 'og:image'
    }]
}