Support streaming file import for project creation

GoogleCodeExporter commented 8 years ago

At the moment, as I understand it, an imported file is uploaded to Refine with 
the entire importing process loading everything into memory.  Only once the 
entire import process has finished is the Refine data file saved and the memory 
released.  This causes the memory limits to be exceeded when dealing with large 
files.

I'd like to propose refactoring the import process to stream files, and deal 
with the data import on a record-by-record basis.  i.e. the Refine project file 
is appended to after every few records and the memory cleared for the next 
batch.

Original issue reported on code.google.com by iainsproat on 30 Sep 2010 at 11:59

GoogleCodeExporter commented 8 years ago

Doing this before too many importers get written would be desirable, but note 
that:

a) this may leave you with a project too big to open again
b) not all parsers support streaming modes (ie you may have to pay the penalty 
regardless of what the interface looks like)

Original comment by tfmorris on 30 Sep 2010 at 2:26

Changed title: Support streaming file import for project creation

GoogleCodeExporter commented 8 years ago

I would like to support a different way of doing things, as the size of 
metadata files I work with on a regular basis include millions of rows. I 
realize that trying to do this on a MacBook Pro with 3GB of RAM is not ideal 
(to say the least), but there are still many others with less RAM at their 
disposal. As it stands, I have yet to successfully load a file of 7 million 
rows (and about a dozen fields per row, with most not having more than a couple 
dozen or so bytes). Could there somehow be a background batch "pre-processing" 
step that I could run to set up the data for viewing and manipulation via my 
web browser? If you want to try this yourself, grab the file here and feel my 
pain: 
<http://www.hathitrust.org/sites/www.hathitrust.org/files/hathifiles/hathi_full_
20101101.txt.gz>

Original comment by roytenn...@gmail.com on 11 Nov 2010 at 5:36

GoogleCodeExporter commented 8 years ago

The current architecture requires all data to be in memory for processing.  
Changing that would require a major reworking of things and require GRefine to 
be backed by a "real" database to do all the processing.

For the forseable future, you should assume that 1.x N memory is required for 
an N-size database, where the goal is to keep x small (say 5-20%).

Original comment by tfmorris on 11 Nov 2010 at 4:26

GoogleCodeExporter commented 8 years ago

Also, in addition to what Tom says, beyond 4 GB Ram requires 64 bit Java to be 
installed.  I myself have the luxury of working with Refine using 9 GB Ram on 
Windows 7 using 64 bit Java and have successfully performed 1 million row 
transforms with 4 fields after allocating enough Ram to Refine. See FAQ 
http://code.google.com/p/google-refine/wiki/FaqAllocateMoreMemory

Original comment by thadguidry on 11 Nov 2010 at 4:36

GoogleCodeExporter commented 8 years ago

Thanks a lot, I guess I'll have to find a machine that is up to the task. I 
eventually got the "java heap space" error. I really appreciate the speedy and 
informative replies.

Original comment by roytenn...@gmail.com on 11 Nov 2010 at 5:32

GoogleCodeExporter commented 8 years ago

Yes, that's one of the unfortunate aspects of Java memory management.  In cases 
where it's going to fail eventually anyway, the garbage collector tries harder 
and harder, slowly grinding to a halt before eventually failing altogether.

I'll take a look and see if we can add a warning for the situation where it 
looks like you are grossly underconfigured for the task you are attempting

Original comment by tfmorris on 11 Nov 2010 at 5:50

GoogleCodeExporter commented 8 years ago

Yeah, some sort of warning would have saved me a lot of time waiting around for 
something to happen, and my disk thrashing. ;-)

Original comment by roytenn...@gmail.com on 11 Nov 2010 at 8:23

GoogleCodeExporter commented 8 years ago

What about something like linking two or more projects together? E.g. 
automatically run operations on linked project(s) in serial by clicking a 
button. The user creates a set of operations in a subset of the uploaded file.*

Once this reaches the desired level of clea, the user then

This is the pipeline that I can think is manageable:

  - user uploads file
  - file checked for size
  - if the file is likely to hit the memory limit, then offer to partition data
  - when partitioning
    - the system saves manageable chunks, 50mb (?)
    - metadata are stored about the relationship between the chunks
    - load first chunk into Refine
    - user refines data
    - user clicks "Process linked projects"
      - operations from first project are replicated in serial to the others
  - happy user, happy data

* I think that it would be possible to partition text files fairly easily, 
however I have my doubts about binary ones.

Original comment by mcnamara.tim@gmail.com on 12 Nov 2010 at 3:21

GoogleCodeExporter commented 8 years ago

I've created issue 467 to track the suggestion of a JVM heap monitor as part of 
the progress indication.  I've implemented an rudimentary version of this as a 
start.

Original comment by tfmorris on 21 Oct 2011 at 4:26

ghmo / google-refine

Support streaming file import for project creation #147