Open GoogleCodeExporter opened 9 years ago
Doing this before too many importers get written would be desirable, but note
that:
a) this may leave you with a project too big to open again
b) not all parsers support streaming modes (ie you may have to pay the penalty
regardless of what the interface looks like)
Original comment by tfmorris
on 30 Sep 2010 at 2:26
I would like to support a different way of doing things, as the size of
metadata files I work with on a regular basis include millions of rows. I
realize that trying to do this on a MacBook Pro with 3GB of RAM is not ideal
(to say the least), but there are still many others with less RAM at their
disposal. As it stands, I have yet to successfully load a file of 7 million
rows (and about a dozen fields per row, with most not having more than a couple
dozen or so bytes). Could there somehow be a background batch "pre-processing"
step that I could run to set up the data for viewing and manipulation via my
web browser? If you want to try this yourself, grab the file here and feel my
pain:
<http://www.hathitrust.org/sites/www.hathitrust.org/files/hathifiles/hathi_full_
20101101.txt.gz>
Original comment by roytenn...@gmail.com
on 11 Nov 2010 at 5:36
The current architecture requires all data to be in memory for processing.
Changing that would require a major reworking of things and require GRefine to
be backed by a "real" database to do all the processing.
For the forseable future, you should assume that 1.x N memory is required for
an N-size database, where the goal is to keep x small (say 5-20%).
Original comment by tfmorris
on 11 Nov 2010 at 4:26
Also, in addition to what Tom says, beyond 4 GB Ram requires 64 bit Java to be
installed. I myself have the luxury of working with Refine using 9 GB Ram on
Windows 7 using 64 bit Java and have successfully performed 1 million row
transforms with 4 fields after allocating enough Ram to Refine. See FAQ
http://code.google.com/p/google-refine/wiki/FaqAllocateMoreMemory
Original comment by thadguidry
on 11 Nov 2010 at 4:36
Thanks a lot, I guess I'll have to find a machine that is up to the task. I
eventually got the "java heap space" error. I really appreciate the speedy and
informative replies.
Original comment by roytenn...@gmail.com
on 11 Nov 2010 at 5:32
Yes, that's one of the unfortunate aspects of Java memory management. In cases
where it's going to fail eventually anyway, the garbage collector tries harder
and harder, slowly grinding to a halt before eventually failing altogether.
I'll take a look and see if we can add a warning for the situation where it
looks like you are grossly underconfigured for the task you are attempting
Original comment by tfmorris
on 11 Nov 2010 at 5:50
Yeah, some sort of warning would have saved me a lot of time waiting around for
something to happen, and my disk thrashing. ;-)
Original comment by roytenn...@gmail.com
on 11 Nov 2010 at 8:23
What about something like linking two or more projects together? E.g.
automatically run operations on linked project(s) in serial by clicking a
button. The user creates a set of operations in a subset of the uploaded file.*
Once this reaches the desired level of clea, the user then
This is the pipeline that I can think is manageable:
- user uploads file
- file checked for size
- if the file is likely to hit the memory limit, then offer to partition data
- when partitioning
- the system saves manageable chunks, 50mb (?)
- metadata are stored about the relationship between the chunks
- load first chunk into Refine
- user refines data
- user clicks "Process linked projects"
- operations from first project are replicated in serial to the others
- happy user, happy data
* I think that it would be possible to partition text files fairly easily,
however I have my doubts about binary ones.
Original comment by mcnamara.tim@gmail.com
on 12 Nov 2010 at 3:21
I've created issue 467 to track the suggestion of a JVM heap monitor as part of
the progress indication. I've implemented an rudimentary version of this as a
start.
Original comment by tfmorris
on 21 Oct 2011 at 4:26
Original issue reported on code.google.com by
iainsproat
on 30 Sep 2010 at 11:59