scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 272 forks source link

l #34

Closed ghost closed 10 years ago

ghost commented 11 years ago

l

hupili commented 11 years ago

I just got to this repo. It's what I'm looking for. It will be better if we can make it more accurate.

For me, the advantage is that the rule is self-learned without the intervention of a developer user. xpath is good for developers. It performs exact extraction on similar structured pages. However, asking a normal user to do that is impossible. Similar techniques are like "selector", "regex", etc. The way scrapely presents fit a normal user.

To extend the discussion, I would like to know:

Many thanks if you can give some pointers.

shaneaevans commented 11 years ago

@jjk3 - The advantage of scrapely is that it doesn't require knowledge of xpath. In practice, this means we that some websites can be scraped without requiring a developer. I have seen some people use examples + some "xpath minimization" technique + custom post-processing to achieve the same thing as Scrapely. I don't think it worked out as well, but I haven't seen any good comparisons.

@hupili - I am not aware of a good alternative implementation, nor of any significant research since we wrote Scrapely. If you are looking into this, I'd be very interested to hear how you get on. I don't know of any datasets of examples. However, if someone is interested in researching this area (and especially if contributing to or improving upon scrapely) I think we could probably make some public datasets of labelled examples.

hupili commented 11 years ago

It seems prof. Liu has shifted his interest,

http://www.cs.uic.edu/~liub/

Data and Web Mining, Machine Learning (Before 1996-97)

I'm not working on this direction, so I can not afford a survey recently. I'll keep an eye here. It will be a useful tool. Looking forward to hearing any updates from you.

shaneaevans commented 11 years ago

What do you mean by xpath minimization and custom processing? How does scrapely beat the performance of that?

This is where a user selects an HTML element. The xpath for that element is transformed to something more generic - typically by removing prefixes. I spoke to someone that did this recently and they thought scrapely was probably a better option, but it worked "good enough" for them.

I've taken the look at the conclusion in the paper by Zhai & Liu, and it almost seems to good to be true. Close to 98% accuracy is remarkable and the number of pages required to train are actually very small.

Applying it to many different verticals, we have seen it work very well in some (98% happens) and worse in others. Taking some time to configure good field descriptors, with appropriate extractors and required attributes helps. The other part of the problem is crawling (and not just extraction). If you are interested in doing both, it could be worth looking at slybot

Now let's consider the possibilities, assume that scrapely is robust and resilient

It is. We use it as part of Scrapinghub's autoscraping service across hundreds of crawling projects, 100m+ items scraped from 10k+ websites.. and it's still in closed beta!

how would you use scrapely to deal with multiple records on a page? Do you have to relabel for each records on that page? What if a record on that page is missing a label, will it grab it anyways? What if the number of records on that page changes (ex. search results)?

You have to label each record on the page. There is an exception for repeated data, where you only need the first couple and the last (e.g. search results list). From here it generalizes and if the next page has a different number it will scrape that amount. I have been thinking that it could be better explicitly labeling repeated data instead of inferring it.

As you can probably gather - I extended the algorithm in Liu & Zhai's paper to handle repeated and nested data. Then we have many additions that have proven useful over the last few years.

Are you saying that once a site is labeled, it can be used on similiar looking pages on other sites? I'm a bit skeptical of this claim because websites differ drastically from each other....

No, only on the same website - from pages that are pretty similar. If that website has vastly different pages you'll need more than one template.

TL;DR: are there any solid performance figures from the real world paper?

We haven't published a paper on Scrapely, but we have a lot of real-world experience with it.

http://www.cs.uic.edu/~liub/ Data and Web Mining, Machine Learning (Before 1996-97)

I'm not working on this direction, so I can not afford a survey recently. I'll keep an eye here. It will be a useful tool. Looking forward to hearing any updates from you.

Cool, I emailed them when we released Scrapely first to let them know we found the paper useful. I read some of Liu's papers on opinion mining for another project I was working on since.

hupili commented 11 years ago

There is an exception for repeated data, where you only need the first couple and the last (e.g. search results list). From here it generalizes and if the next page has a different number it will scrape that amount. I have been thinking that it could be better explicitly labeling repeated data instead of inferring it.

This is an awesome tip!

I tried to extract editors from from wiki pages (baidu baike). I did experiment with the same two pages two days ago but got some messy result (even partial HTML tags in the output). That was why I said the accuracy is not good (1 out of my 2 initial trials failed)... The pages changed later so you can not repeat it (I should have stored the pages... Sigh).

With the above tip, I got perfect result (all editor names and the output is clean).

url1='http://baike.baidu.com/view/10378153.htm'
data={'editor': ['学车新人2011',  '抗战了', '﹁_﹁颖子', '向钱转转']}
url2='http://baike.baidu.com/view/750254.htm'

s.train(url1, data)
d = s.scrape(url2)
import pprint
pprint.pprint(d)
$python test.py
[{u'editor': [u'wsxxsw9',
              u'\u6211\u60f3\u4f60\u597d\u51e0\u5929',
              u'\u6740\u5e7f\u544a\u8005',
              u'andynoty',
              u'ssss9032',
              u'957264812',
              u'\u6731\u7c73\u6dc7',
              u'hiombi']}]

Some further questions:

sapanda commented 10 years ago

Thanks for the post. Did you ever get scrapely to train multiple items per list? I can see from the tests that the functionality exists in the extractor (eg. See line 1260 in test_extraction.py). But is it possible to train it to do that as well?

For example, I've got an html page with 2 lists containing a title and a desc each, and this is the data I'm passing:

data = {
    'Title'  : ['Title_First', 'Title_Second', 'Title_Last'],
    'Source' : ['Source_First', 'Source_Second', 'Source_Last']
}

The training/extractor works fine if only one of those lines is present. But with both, the behavior is random. Anyone have a solution?

sapanda commented 10 years ago

In case anyone hits the same issue, here's the workaround that I used:

s.train(url, {'Title' : ['Title_1', 'Title_2']}
s.train(url, {'Source' : ['theatlantic.com', 'nytimes.com']}
data_list = []
for template in s._templates:
    s_copy = copy.copy(s)
    s_copy._templates = [template]
    data_list.append(s_copy.scrape(url))
[{'Title': ['T1', 'T2', 'T3']}, {'Source'; ['S1', 'S2', 'S3']}]
[{'Title': 'T1', 'Source': 'S1'}, 
 {'Title': 'T2', 'Source': 'S2'},
 {'Title': 'T3', 'Source': 'S3'}]

Of course, the main caveat here is when the lengths of the lists don't match. For example, in my case, the scraper might return 3 titles, but only 2 sources. At that point, it's hard to tell which is the item thats missing the source, and so I can't use any of the sources (i.e. I return [{'Title': 'T1'}, {'Title': 'T2'}, {'Title': 'T3'}]). Definitely not ideal.

kmike commented 10 years ago

@sapanda I think the standard trick is to train 3 templates: annotate list beginning, list end and something in the middle, not to split templates by field. Not a scrapyd expert though :)

There is an exception for repeated data, where you only need the first couple and the last (e.g. search results list).

sapanda commented 10 years ago

@kmike Thanks for the response! But I'm a little confused - do you mean train 3 separate templates like this:

s.train(url, {'Title' : 'Title_First', 'Source': 'Source_First})
s.train(url, {'Title' : 'Title_Second', 'Source': 'Source_Second})
s.train(url, {'Title' : 'Title_Last', 'Source': 'Source_Last})

I can't seem to get it to work - in my experience s.scrape() only uses a single template (usually the first, but not always). Can you give an example?

If you mean the list passed in has to have first, second and last, then yeah I've tried that and it works well for lists of strings, but not list of dicts. Browsing through the scrapely code (See train_from_htmlpage() in init.py), it looks like training only expects keys to be either string or list of string. And passing in multiple lists of strings seems to give undefinable behavior (sometimes errors, sometimes empty lists, and sometimes weird data).