Open GoogleCodeExporter opened 8 years ago
This is actually available now in the current trunk build. If you have the know
how to build from trunk - see [DevelopersGuide] The Jsoup.org library that was
integrated by Iain Sproat let's you take advantage of that. You can also use
Python as your language of choice in the expression editor and import lxml.html
and lxml.etree , if you prefer that syntax and style over Jsoup or
BeautifulSoup. (See [Jython]) Feel free to reach out to me and others on our
mailing list as well for additional help on any of the above.
Original comment by thadguidry
on 22 Dec 2010 at 4:27
Thanks for the hint. I got the current trunk from SVN and got it running.
But it's still non-obvious to me how to me how to import data in the way I had
in mind: When I open an XML file, I don't see how I can apply an XPath before
importing. Instead, even for a moderately large (15MB) file, Refine takes ages
(> 15 minutes) and loads of memory (> 700MB) to read the file. Presumably it
needs so much memory because it creates a lot of records (> 100000), even
though, if read correctly, the file only contains 2000 records.
I don't see how I can recover my data which has been split up so much during
the import. That's why I thought being able to explicitly point the software to
the relevant XPaths might improve both efficiency and results considerably.
I attach the data file I am seeing the problems with, so get a better idea what
I'm talking about.
Original comment by cv4...@gmail.com
on 23 Dec 2010 at 9:11
Attachments:
Don't split it. Don't do anything to it during import. Uncheck everything and
then let it load. (Should load everything into one cell, I think) Once loaded
then perform your slicing and dicing and Xpath expressions, etc. I think the
team has only briefly talked about performing expressions prior to loading. But
we had to get a consensus on how to handle the UX for it all. James Home has
given us some concepts to work with in Issue 284 and Issue 285 , Take a look at
the screenshots 17 and 18 in there and you'll see how we intend to deal with
XML specifically and giving you a nice UI selector. And then let us know your
thoughts in that Issue 285
Iain, Tom, David can you recall if we perhaps left that off our [Roadmap] for
good reason, since we're still in the planning & implementation for that
feature ?
Original comment by thadguidry
on 23 Dec 2010 at 2:37
Thanks for the hints and comments, @thadguidry.
I unchecked all options on the import page but still get a very slow import
with the splitting (using the current trunk version). My – possibly
unqualified – impression from other experiments is that I can only suppress
the splitting into columns but not that into lines.
I took a look at the screenshots in Issue 285 and like what I'm seeing there.
Your idea of a more intuitive selector to determine which XML elements should
become records in Refine, seems much more accessible than my ad-hoc idea using
XPath.
Original comment by cv4...@gmail.com
on 27 Dec 2010 at 1:12
Thad, yes, we purposely left the picking of XML xpath until the next version.
Original comment by dfhu...@gmail.com
on 5 Jan 2011 at 9:40
Original issue reported on code.google.com by
cv4...@gmail.com
on 22 Dec 2010 at 10:36