slaveofcode / boilerpipe3

A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.
45 stars 15 forks source link

java.lang.OutOfMemoryError: Java heap space #8

Open cgr71ii opened 2 years ago

cgr71ii commented 2 years ago

Hi!

I've been using Boilerpipe with Bitextor, and everything has worked out fine. The problem is that when I processed a PDF file, specifically this one, I run out of memory and the execution failed. The error message I got is:

Traceback (most recent call last):                                                                                                                                                                           
  File "BoilerpipeSAXInput.java", line 51, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument                                                                                                       
  File "BoilerpipeSAXInput.java", line 63, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument                                                                                                       
  File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.parse                                                                                     
  File "org.apache.xerces.parsers.XMLParser.java", line -1, in org.apache.xerces.parsers.XMLParser.parse                                                                                                     
  File "HTMLConfiguration.java", line 452, in org.cyberneko.html.HTMLConfiguration.parse                                                                                                                     
  File "HTMLConfiguration.java", line 499, in org.cyberneko.html.HTMLConfiguration.parse                                                                                                                     
  File "HTMLScanner.java", line 907, in org.cyberneko.html.HTMLScanner.scanDocument                                                                                                                          
  File "HTMLScanner.java", line 1967, in org.cyberneko.html.HTMLScanner$ContentScanner.scan                                                                                                                  
  File "HTMLScanner.java", line 2291, in org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters                                                                                                        
  File "DefaultFilter.java", line 152, in org.cyberneko.html.filters.DefaultFilter.characters                                                                                                                
  File "HTMLTagBalancer.java", line 954, in org.cyberneko.html.HTMLTagBalancer.characters                                                                                                                    
  File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.characters                                                                                
  File "BoilerpipeHTMLContentHandler.java", line 293, in de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters                                                                                       
  File "BitSet.java", line 447, in java.util.BitSet.set                                                                                                                                                      
  File "BitSet.java", line 352, in java.util.BitSet.expandTo                                                                                                                                                 
  File "BitSet.java", line 337, in java.util.BitSet.ensureCapacity                                                                                                                                           
  File "Arrays.java", line 3308, in java.util.Arrays.copyOf                                                                                                                                                  
Exception: Java Exception                                                                                                                                                                                    

The above exception was the direct cause of the following exception:                                                                                                                                         

Traceback (most recent call last):                                                                                                                                                                           
  File "<stdin>", line 1, in <module>                                                                                                                                                                          File "/home/cgarcia/miniconda3/envs/bitextor/lib/python3.8/site-packages/boilerpipe/extract/__init__.py", line 67, in __init__                                                                                 self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()                                                                                                                                  
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space                                                                                                                                      

In order to get rid of Bitextor for the explanation of this issue, I attach to this issue the file which Bitextor generated from the PDF, which is an HTML, and the attached HTML is the one that causes this problem. The file size is 9.4 MB, which I don't know if is a size too big to make Boilerpipe fail. The problem is not related to the PDF itself, since I processed other PDFs and the process finished without errors.

In the end, I figured out that the problem was actually due to the memory (initially I though about a memory leak), what was really weird to me since it is a 9.4 MB file. I fixed the problem increasing the quantity of memory of jpype. The total quantity of memory which a 9.4 MB HTML file required was of ~52 GB!!!!!!! My system has 126 GB, so the default max. heap size of the JVM is 30 GB. Since the process was requiring 52 GB and the max. heap size was 30 GB, I was running out of memory.

The reason of this issue is to alert other people which might have the same problem and to ask the following question: do these numbers make sense? I mean, 52 GB of memory for an HTML file of 9.4 MB?

The code which triggers the error:

from boilerpipe.extract import Extractor

text = ""

with open("boilerpipe_error.html") as f:
  for l in f:
    text += l

text = text.strip()

Extractor(extractor='ArticleExtractor', html=text)

The fix (run before the above code; it should work, but I haven't tested it out of the actual file, so I might have miss something):

import os
import jpype
import importlib

# Take 80 GB of memory for boilerpipe
boilerpipe_max_heap_size = 80 * 1024 # TODO change this value

if not jpype.isJVMStarted():
    max_heap_size = f"-Xmx{str(options.boilerpipe_max_heap_size)}M" if options.boilerpipe_max_heap_size >= 0 else ''
    jars = []

    for top, dirs, files in os.walk(os.path.dirname(importlib.machinery.PathFinder().find_module("boilerpipe").get_filename()) + '/data'):
        for nm in files:
            if nm[-4:] == ".jar":
                jars.append(os.path.join(top, nm))

    jpype.addClassPath(os.pathsep.join(jars))

    jargs = [jpype.getDefaultJVMPath()]

    if max_heap_size != '':
        jargs.append(max_heap_size)

    jpype.startJVM(*jargs, convertStrings=False)

# ... run boilerpipe

html.tar.gz

Thrameos commented 2 years ago

What version of JPype is being used here?

cgr71ii commented 2 years ago

Package JPype1, version 1.3.0

Thrameos commented 2 years ago

Given that there is only one JPype call I suspect this is all on the Java side. Older versions of JPype have reference counting issues that can cause the memory foot print of Java to leak. I would repeat the experiment with pure Java to verify that the issue is indeed a pure Java problem.