pattersonkl / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

XPath support for creating columns from XML would be nice #293

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I keep finding that Refine's heuristics for grabbing data from XML don't match 
my needs and lack the control I'd like to have.

To me it would seem very useful to have XPath support for XML data, so one 
could specify one XPath pointing to the XML elements that make up the records 
and another array of XPaths pointing to the fields inside those records.

That may not be quite as automatic as the current approach. But it would mean I 
could feed data directly into Refine without a need for preprocessing.

I also suspect that this approach will provide a workable solution for other 
situations in which data has to be extracted from a web page, say (e.g. case 
56).

Original issue reported on code.google.com by cv4...@gmail.com on 22 Dec 2010 at 10:36

GoogleCodeExporter commented 8 years ago
This is actually available now in the current trunk build. If you have the know 
how to build from trunk - see [DevelopersGuide] The Jsoup.org library that was 
integrated by Iain Sproat let's you take advantage of that.  You can also use 
Python as your language of choice in the expression editor and import lxml.html 
and lxml.etree , if you prefer that syntax and style over Jsoup or 
BeautifulSoup.  (See [Jython])  Feel free to reach out to me and others on our 
mailing list as well for additional help on any of the above.

Original comment by thadguidry on 22 Dec 2010 at 4:27

GoogleCodeExporter commented 8 years ago
Thanks for the hint. I got the current trunk from SVN and got it running.

But it's still non-obvious to me how to me how to import data in the way I had 
in mind: When I open an XML file, I don't see how I can apply an XPath before 
importing. Instead, even for a moderately large (15MB) file, Refine takes ages 
(> 15 minutes) and loads of memory (> 700MB) to read the file. Presumably it 
needs so much memory because it creates a lot of records (> 100000), even 
though, if read correctly, the file only contains 2000 records.

I don't see how I can recover my data which has been split up so much during 
the import. That's why I thought being able to explicitly point the software to 
the relevant XPaths might improve both efficiency and results considerably.

I attach the data file I am seeing the problems with, so get a better idea what 
I'm talking about.

Original comment by cv4...@gmail.com on 23 Dec 2010 at 9:11

Attachments:

GoogleCodeExporter commented 8 years ago
Don't split it.  Don't do anything to it during import.  Uncheck everything and 
then let it load. (Should load everything into one cell, I think) Once loaded 
then perform your slicing and dicing and Xpath expressions, etc.  I think the 
team has only briefly talked about performing expressions prior to loading. But 
we had to get a consensus on how to handle the UX for it all.  James Home has 
given us some concepts to work with in Issue 284 and Issue 285 , Take a look at 
the screenshots 17 and 18 in there and you'll see how we intend to deal with 
XML specifically and giving you a nice UI selector.  And then let us know your 
thoughts in that Issue 285

Iain, Tom, David can you recall if we perhaps left that off our [Roadmap] for 
good reason, since we're still in the planning & implementation for that 
feature ?

Original comment by thadguidry on 23 Dec 2010 at 2:37

GoogleCodeExporter commented 8 years ago
Thanks for the hints and comments, @thadguidry.

I unchecked all options on the import page but still get a very slow import 
with the splitting (using the current trunk version). My – possibly 
unqualified – impression from other experiments is that I can only suppress 
the splitting into columns but not that into lines.

I took a look at the screenshots in Issue 285 and like what I'm seeing there. 
Your idea of a more intuitive selector to determine which XML elements should 
become records in Refine, seems much more accessible than my ad-hoc idea using 
XPath.

Original comment by cv4...@gmail.com on 27 Dec 2010 at 1:12

GoogleCodeExporter commented 8 years ago
Thad, yes, we purposely left the picking of XML xpath until the next version.

Original comment by dfhu...@gmail.com on 5 Jan 2011 at 9:40