leerssej / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

Feature Request: Html parsing and xpath #220

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Would be really great to get support for fetching a URL, parse it's HTML 
contents into a DOM (look at .net HtmlAgilityPack) and be able use XPATH. 

Original issue reported on code.google.com by niels.bo...@gmail.com on 16 Nov 2010 at 1:57

GoogleCodeExporter commented 8 years ago
Is this to create a new project or populate a column or ...?

A concrete use case, perhaps including an example of a real URL that you would 
use for this, would be helpful in understanding what you need.

Original comment by tfmorris on 16 Nov 2010 at 6:12

GoogleCodeExporter commented 8 years ago
To keep it simple let say I have a coloumn of search phrases and I want to get 
the first result in Google SERP.

So real world URL: 
http://www.google.se/#hl=sv&source=hp&q=candy&aq=f&aqi=g10&aql=&oq=&gs_rfai=&fp=
7f7a53791ad6dfb5

Using XPATH to get data out of webpages is so much easier than using regexpes. 

Original comment by niels.bo...@gmail.com on 16 Nov 2010 at 8:13

GoogleCodeExporter commented 8 years ago
GREL functions have been added to allow HTML parsing in r1948.
It is now possible to use selector syntax to select DOM elements 
http://jsoup.org/cookbook/extracting-data/selector-syntax

e.g.
value.parseHtml().select("a")[0].htmlAttr("href")

Original comment by iainsproat on 6 Dec 2010 at 11:22

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 9 Jun 2011 at 7:58