thagale / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

17 million row (68m cells) file requires more than 8G of heap to import #242

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. Windows 7 x64 with 8GB of physical memory

2. Large csv file (733mb)

3. In google-refine.l4j.ini, set

# max memory memory heap size
-Xmx4096M

4. Save 

5. Double click Google-refine.exe

6. Immediate error: 

Invalid maximum heap size: -Xmx4096M

7. if I set Xmx1550M (~the maximum I seem to be able to use), after about 5 
minutes of importing the csv file, I get an out of stack memory error.

8. The system reports that prior to running Refine, 2gb is in use.

Original issue reported on code.google.com by leebel...@gmail.com on 20 Nov 2010 at 3:59

GoogleCodeExporter commented 8 years ago
Do you have 64bit Java installed and also set as your default ?  In other 
words, Java_Home env variable ?  You'll need 64bit Java in order to go beyond 
3.5 GB RAM usage.  See attached screenshot.

Original comment by thadguidry on 20 Nov 2010 at 4:24

Attachments:

GoogleCodeExporter commented 8 years ago
Addressed the x64 Java issue, set Xmx5120 and tried again - after 5 minutes of 
loading, then got

HTTP ERROR 500

Problem accessing /command/core/create-project-from-upload. Reason:

    GC overhead limit exceeded

Caused by:

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at au.com.bytecode.opencsv.CSVParser.parseLine(Unknown Source)
    at au.com.bytecode.opencsv.CSVParser.parseLineMulti(Unknown Source)
    at com.google.refine.importers.TsvCsvImporter.getCells(TsvCsvImporter.java:196)
    at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:163)
    at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:74)

Original comment by leebel...@gmail.com on 20 Nov 2010 at 9:03

GoogleCodeExporter commented 8 years ago
Is that a typo in your comment or did you really not have a trailing 'M' on 
your size?  If you used 5120 instead of 5120M, you probably set your heap to 
5120 bytes or perhaps 5120 KBytes, neither of which will work very well.

Original comment by tfmorris on 20 Nov 2010 at 3:11

GoogleCodeExporter commented 8 years ago
Also...look at the 2nd line in your command window when you start refine with:
C:\your_path_to_refine.bat_file\refine /m 5132m

The 2nd line in command window log should show how much memory was successfully 
executed to the Java.exe process.

Another thing to look at is start Task Manager, click on Processes tab, and 
look at how much memory is being utilized for Java.exe process.

Finally double check things at our [FaqAllocateMoreMemory]

Original comment by thadguidry on 20 Nov 2010 at 3:23

GoogleCodeExporter commented 8 years ago
I actually did have "Xmx5120m". I then did another run using:

refine.bat /m 6000m >refine.out

and, after about 30 minutes got

HTTP ERROR 500

Problem accessing /command/core/create-project-from-upload. Reason:

    Java heap space

Caused by:

java.lang.OutOfMemoryError: Java heap space

The 'Peak working set memory' got to around 6295076K. I've attched the 
refine.out file. Any help would be appreciated (and thank you for the prompt 
help that you have given so far - it is appreciated)

Original comment by leebel...@gmail.com on 21 Nov 2010 at 5:32

Attachments:

GoogleCodeExporter commented 8 years ago
Can you file split spatialinfo2.csv and try again ?  That's a BIG file.  I've 
tested with a 1 GB file before and 4 columns, but the data was interspersed and 
so Refine absorbed it in about 10 mins.  Your file on the other hand, might 
just be TOO BIG for the current architecture.  Anyone else have ideas for this 
fellow ?

Original comment by thadguidry on 21 Nov 2010 at 5:52

GoogleCodeExporter commented 8 years ago
Yes, sorry it is a big file: 17 million records by 4 columns. It can split it, 
but it makes the analysis considerably more complicated. Refine looked like THE 
ideal way to analyse these records.

The file contains the complete location records of all species occurrences in 
the Australian region for the Atlas of Living Australia (www.ala.org.au: I am 
the Spatial Data Manager). Column 1 and 2 are latitude longitude in decimal 
degrees. Column 3 is spatial accuracy (numeric, text and both) and column 4 is 
text description of location. 

I am trying to analyse all the variations of column 3, and in conjunction with 
columns 1,2 and 4, can develop an estimate for spatial uncertainty.

Needless to say, any ideas would be greatly appreciated. I certainly have 
greatly appreciated your help on this one. 

I went out Saturday to purchase 4 x 2gb memory sticks in the hope that 8GB 
would suffice. I did BTW have 1GB setup as paging on an SDD. I will try to see 
if 6500mb gets me closer.

Original comment by leebel...@gmail.com on 21 Nov 2010 at 8:28

GoogleCodeExporter commented 8 years ago
That should be plenty of memory for this case unless Refine is being grossly 
inefficient or the text descriptions are enormous.  What is the total raw 
(uncompressed) size of the input data?

Thad - for your 1M row case, what was the size of the input data and what was 
the resulting virtual size of the Refine process?  It will vary by data type, 
but I'd expect memory usage to be basically linear in this range (1M-17M rows).

Original comment by tfmorris on 21 Nov 2010 at 12:43

GoogleCodeExporter commented 8 years ago
My test filesize was 1 GB on disk.  1 million rows, interspersed data
along 4 columns - it was an injection of the NFDC data, so my column 1
was REALLY long, like 800 chars at times I recall, 20% blanks in
columns 3 & 4.

The virtual size of the Refine (Gridworks) NFDC test project came out
to around 350 - 400 MB. hmm...maybe the blanks helped reduce that
here?

After 10 minutes of importing (back in 1.1 days) my Java.exe. process
peaked to 800 MB in Windows7 using Java64bit and 8 GB Ram for heap.
During my initial testing...the automatic saving project to disk was a
bit too aggressive and David tuned it a bit in Issue-3.

I'm thinking that he could probably use Rapid Miner instead to handle
his analysis.  It is a good match for doing exactly that kind of
analysis as well.  You might want to download it and give it a try.
We have the link under RelatedSoftware on wiki.

Still, I think that we probably need to go back and really test
Refine's memory utilization (post 1.1) to make sure that it is within
parameters still.  I haven't done it to that capacity in a while.

Original comment by thadguidry on 21 Nov 2010 at 5:48

GoogleCodeExporter commented 8 years ago
Thanks for the reference to Rapid Miner. I will take a look. But I hope you are 
not admitting defeat for Refine on my data :) I'd be happy for you guys to take 
a look at our data as a test case. To me, it looked like a classic fit for 
Refine.

The zipped file is 40mb (attachment limit is 10mb) so I've put it here: 
http://dl.dropbox.com/u/8650868/spatialinfo2.zip. 

Please let me know when you have it (or if you don't want it). Thanks again for 
your support on this issue. Impressive.

Original comment by leebel...@gmail.com on 21 Nov 2010 at 8:39

GoogleCodeExporter commented 8 years ago
I downloaded it.  Sure enough, so far I get the same results you do which is 
unfortunate.  Using Refine /m 6144m and Java.exe climbed to 6545m usage and 
seemed to progress well until it got to 66% uploading complete, then tanked and 
rapidly swelled to 100% complete and the Error 500 Java heap space all within 5 
mins.

Thanks we'll investigate more and let you know. (diving into Profiling now)

Original comment by thadguidry on 22 Nov 2010 at 12:08

Attachments:

GoogleCodeExporter commented 8 years ago
with JAVA_OPTIONS="-XX:-UseParallelGC" and 8144m got to 82% and then

java.lang.OutOfMemoryError: Java heap space
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
    at java.lang.StringBuilder.<init>(StringBuilder.java:80)
    at au.com.bytecode.opencsv.CSVParser.parseLine(Unknown Source)
    at au.com.bytecode.opencsv.CSVParser.parseLineMulti(Unknown Source)
    at com.google.refine.importers.TsvCsvImporter.getCells(TsvCsvImporter.java:196)
    at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:163)
    at com.google.refine.importers.TsvCsvImporter.read(TsvCsvImporter.java:74)
    at com.google.refine.commands.project.CreateProjectCommand.internalInvokeImporter(CreateProjectCommand.java:478)
    at com.google.refine.commands.project.CreateProjectCommand.load(CreateProjectCommand.java:341)
    at com.google.refine.commands.project.CreateProjectCommand.internalImportFile(CreateProjectCommand.java:327)
    at com.google.refine.commands.project.CreateProjectCommand.internalImport(CreateProjectCommand.java:169)
    at com.google.refine.commands.project.CreateProjectCommand.doPost(CreateProjectCommand.java:112)
    at com.google.refine.RefineServlet.service(RefineServlet.java:170)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
    at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
    at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)

Original comment by thadguidry on 22 Nov 2010 at 1:15

GoogleCodeExporter commented 8 years ago
Switching testing on my Win7 system to JDK 6u21 and enabling GC1 experimental 
for differencing profile.

Original comment by thadguidry on 22 Nov 2010 at 1:41

GoogleCodeExporter commented 8 years ago
Issue appears to be within CSVParser or the wiring to cells.  Refine was not 
originally designed to handle more than 100,000 rows.  This will require a 
revisit with possible underlying architecture changes in later revisions. (in 
other words, we're admitting defeat with spending more time working on the 
issue until we can devote more time to thinking through the architecture 
redesign to handle larger datasets such as this)

  Thanks again for trying Google Refine and don't lose hope, we'll get there I'm sure. Like Tom says, this file should be easy cheesy.  For instance, I was able to fully open your .csv file in Notepad++ with it taking only about 600MB of memory, so it should be possible in Refine as well in theory.  Just need time to track down the bugs (yet discovered) it revolves around.

Original comment by thadguidry on 22 Nov 2010 at 4:29

GoogleCodeExporter commented 8 years ago
Thanks Guys. Appreciate the work you put it. It would have been nice to Refine 
the file, but if it can help identify and address issues, not a total loss.

Original comment by leebel...@gmail.com on 22 Nov 2010 at 5:12

GoogleCodeExporter commented 8 years ago
I'm getting a 404 from the dropbox url.  Was it a one time download?

Thad - since you have the only copy, can you post the memory profile and the 
results of your debugging?

Original comment by tfmorris on 22 Nov 2010 at 5:53

GoogleCodeExporter commented 8 years ago
Sorry, thought you had finished with it. I've put it back up at 
http://dl.dropbox.com/u/8650868/spatialinfo2.zip 

Original comment by leebel...@gmail.com on 22 Nov 2010 at 6:03

GoogleCodeExporter commented 8 years ago
Thanks for putting it back.  I've got my copy, although Stefano or David may 
want one as well.  We can probably arrange to share among the team members if 
you want to take your copy down.

From the back of the envelope calculations that I did, assuming that the file 
is relatively homogeneous, you're looking at heap requirements of over 8GB to 
import the whole file.  If you set your max heap size to, say 9 or 10 GB, and 
had a sufficiently sized page file, you should be able to get it imported, but 
you'll be hitting disk for paging with every pass through the data, which could 
put a significant damper on performance (really depends on the access 
characteristics of the algorithms that you end up using for your analysis, 
although I'd assume the vast majority of them are just linear sweeps through 
the rows).

A typical row contains four cells: two doubles, an empty cell, and an 88 
character string, totaling 480 bytes (on a 64-bit machine, it'll be slightly 
less on a 32-bit processor).

I think we can probably do better than this, but for now that's what you're 
dealing with...

Original comment by tfmorris on 23 Nov 2010 at 6:10

GoogleCodeExporter commented 8 years ago
Thanks. I took down the file again but happy to put it back up if you need it. 
I looked at RapidMiner but it's way too broad a system for me to get into for 
this one application.

So, I split the file in half and got the first half into Refine with no 
problems. Now I'm starting to come to grips with it. Even in half, many 
operations take a while - but that is AOK with me. Slow is fine, busted is 
something else.

Thanks again for your support with this one!

Original comment by leebel...@gmail.com on 23 Nov 2010 at 6:44

GoogleCodeExporter commented 8 years ago
I've updated the header to align better with the actual issue.  I'm not sure 
it's something that's fixable, but I'll leave it open as a data point for some 
future person working on memory performance optimization.

Original comment by tfmorris on 7 Jan 2011 at 4:28

GoogleCodeExporter commented 8 years ago
Thanks. When I get time, I'm still plugging away using half the file (using 
"refine /m 6000m"): Refine is a very neat tool.

Original comment by leebel...@gmail.com on 9 Jan 2011 at 9:18

GoogleCodeExporter commented 8 years ago
Issue 346 has been merged into this issue.

Original comment by dfhu...@gmail.com on 11 Mar 2011 at 7:48

GoogleCodeExporter commented 8 years ago
Hello,

I am also a biologist working with large files, and I have the same issue that 
is discussed above when trying to load a 3.0G file in Refine. I think that my 
database contains many, many more cells than the example above. That is, many 
more columns but fewer rows. Haas there been any progress on fixing this bug 
since March? Thanks!

Original comment by dylan.o....@gmail.com on 28 Jun 2012 at 8:30

GoogleCodeExporter commented 8 years ago
Dylan,
You may want to look at Taverna http://www.taverna.org.uk/ for your specific 
needs instead.

Original comment by thadguidry on 28 Jun 2012 at 3:30

GoogleCodeExporter commented 8 years ago
Hello

I was trying to do clustering of the rows using the text clustering feature for 
50,000 rows. Initially my file size wa around 800,000 rows but i reduced the 
file size to 50,000 rows and also increased the Vm memory to 5120M in my 
machine. I have a mac with 8GB memory. Wonder if there is a feasible solution 
for data clustering of rows? My file size is 1.1 MB. Have anyone in the past 
were successful in using text mining feature with large files? Any comments or 
suggestions is greatly appreciated. 

Veeresh

Original comment by vthumm...@gmail.com on 9 Apr 2014 at 3:05

GoogleCodeExporter commented 8 years ago
Hello,
I work in a software company in Brazil and  we are currently developing on a 
tool for data cleansing, but only for datasets in the order of millions of 
records. I would love to hear your problems and help them. We will have a free 
version of our tool. 
I leave my email: pedro.magalhaes@stoneage.com.br
vthumm, dylan, leebel feel free to contact me.

Original comment by pedror...@gmail.com on 18 Sep 2014 at 8:09