ghmo / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Support Wikipedia HTML tables #56

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Feature request

Supporting HTML tables which contain Wikipedia links would be very valuable 
for loading Wikipedia data because the links represent pre-reconciled 
entities (except for red links).

Original issue reported on code.google.com by tfmorris on 21 May 2010 at 5:25

GoogleCodeExporter commented 8 years ago
What you mean by "supporting"? You mean being able to point gw to a wikipedia 
page containing a table and 
create a grid from it?

Original comment by stefano.mazzocchi@gmail.com on 21 May 2010 at 5:32

GoogleCodeExporter commented 8 years ago
This is probably best supported by an independent bookmarklet. Gridworks itself 
should 
support creating a project by pointing to a data file URL and by pasting in raw 
text.

Original comment by dfhu...@gmail.com on 21 May 2010 at 5:41

GoogleCodeExporter commented 8 years ago

Original comment by stefano.mazzocchi@gmail.com on 21 May 2010 at 5:43

GoogleCodeExporter commented 8 years ago
An html table importer would also be useful.  Adapting the xml importer to deal 
with 
<th>, <tr> and <td> would be a start.

Original comment by iainsproat on 21 May 2010 at 5:46

GoogleCodeExporter commented 8 years ago
I'm open concerning implementation as long as it preserves the Wikipedia link 
and uses 
it to resolve to an exact Freebase topic without human intervention.

Possibilities that come to mind include:
- allowing an HTML table cut from a web page to be pasted into Gridworks
- recognizing Excel / Open Office spreadsheets which contain Wikipedia links

Bonus points for the least number of manual steps to produce useful results.

Original comment by tfmorris on 21 May 2010 at 5:50

GoogleCodeExporter commented 8 years ago
A possibility is to convert table content in csv. A bookmarklet for that: 
http://table2csv.zeusi.user.dev.freebaseapps.com/index

One problem with wikipedia links is that blue and red links are mixed. If we 
convert
the good ones, we will get in the same column ids and names, but we should be 
able to
reconcile them separately with facets.

Original comment by antonio....@gmail.com on 25 May 2010 at 10:12

GoogleCodeExporter commented 8 years ago

Original comment by iainsproat on 14 Oct 2010 at 5:02

GoogleCodeExporter commented 8 years ago
I figured out where HTML links are stored in OO Calc, so I may be able to 
easily add the ability to optionally convert linked cells in a value + link 
pair (or even convert the Wikipedia link to a properly escaped Freebase key).  
Hmmm, thinking out loud, a general HTML link -> Freebase key function in Refine 
which was functionally the same as the old web client link parser could be very 
useful.  I think all the URI templates are still available even though they 
aren't being used (on input) any more.

Another peculiarity of Wikipedia tables that I just discovered the other day is 
the use of <span style="display:none"> elements as sort helpers.  I don't know 
about Excel, but OO Calc can't handle this at all.

A typical usage (from memory) might be something like

  <span style="display:none">0000123456000000</span>1,234.56 

where an invisible, zero padded, fixed point, numeric only string is created so 
that it will collate properly using an alpha sort which mimics the numeric 
sort.  Unfortunately the few tools that I tried ignored the styling and munged 
the two strings together making them pretty much useless without manual cleanup.

Not sure what can be done about that one.

Original comment by tfmorris on 14 Oct 2010 at 6:25

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 18 Sep 2012 at 3:16