scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 315 forks source link

Obtaining sectioned article text #68

Open Shadonar opened 9 years ago

Shadonar commented 9 years ago

Hello,

I have a project that I was looking to use Scrapely for. From what I've read and found out this sounds like it's something that I would like to use. I have run into a problem with it though. when I pass a url that contains sectioned article text (which appears to be almost all of my urls) I only receive the first section of the text.

Here's a site that I tried: http://www.autostraddle.com/12-black-friday-deals-you-can-get-without-having-to-put-pants-on-266850/

and here's what I used to train scrapely:

{'title':'15 Things You Learn When You Move In With Your Girlfriend', 'author': 'by Kate', 'postdate':'November 10, 2014 at 9:00am PST', 'count':'82', 'content':'There comes a point in every relationship when it makes sense for you to think about cohabitation.'}

if I then have scrapely scrape that same url it only gives me that first paragraph.

So my question is, how would I get scrapely to obtain all of the articles main text (basically the text between the social media icons).

Any help would be greatly appreciated!

Thanks

kalessin commented 9 years ago

Hi Shanodar,

try to add as 'content' value, a list containing two elements: the content of the first paragraph, and the content of the last one. So you will train the algorithm to perform an iterated extraction over all them.