misja / python-boilerpipe

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
Other
539 stars 143 forks source link

A fatal error has been detected by the Java Runtime Environment: SIGSEGV (0xb) #17

Open rshiva opened 10 years ago

rshiva commented 10 years ago

Hey, Python-biolerpipe work perfectly from the console and as a script but when i trying it out with my flask application it breaks .This break when i try to instantiated Extractor and pass the url . This is what i get

http://pastebin.com/Rhzfh3hE

Initially i thought this problem is coming from jpype i raised a ticket there too . Didint help much https://github.com/originell/jpype/issues/22

Environment details

Python - 2.7.3
java version "1.7.0_45"

Flask==0.10.1
JPype1==0.5.4.5
boilerpipe==1.2.0.0

I did saw similar issue been raised but that didnt help much :-/ . Any help will be appreciated.Thanks

originell commented 10 years ago

So the snippet posted in originell/jpype#22 does not help? If this is a boilerpipe issue I can close the issue in our jpype fork.. ;>

rshiva commented 10 years ago

I dont know as @tcalmant mentioned i can start JVM and attach thread . i think the problem is with boilerpipe i have also posted in stackoverflow . It can give you more idea about the problem http://stackoverflow.com/questions/21310011/jvm-crashes-while-implementing-python-boilerpipe-in-flask-app

tcalmant commented 10 years ago

According to the trace posted on pastebin, this is a class loading problem. I suppose this comes from line 56-57 in boilerpipe/extract/__init__.py, where the jPype is used to load a specified extractor.

Could you add some traces around these lines ? (use a logger and/or don't forget to flush the sys.stdout/stderr). Also, is the buggy code public ? Or do you have a snippet having a similar behaviour ? I'll check the problem this evening (Europe Timezone)

tcalmant commented 10 years ago

OK, I've reproduced the bug : the thread that calls the JVM is not attached to it, therefore the calls to JVM internals fail. The bug comes from boilerpipe (see below).

First, monkey patching : in the code you posted on stackoverflow, you just have to add the following code before the creation of the extractor :

class ExtractingContent:
    @classmethod
    def processingContent(self,sourceUrl,extractorType="DefaultExtractor"):
        print "State=", jpype.isThreadAttachedToJVM()
        if not jpype.isThreadAttachedToJVM():
            print "Needs to attach..."
            jpype.attachThreadToJVM()
            print "Check Attached=", jpype.isThreadAttachedToJVM()
        extractor = Extractor(extractor=extractorType, url=sourceUrl)

About boilerpipe: the check if threading.activeCount() > 1 in boilerpipe/extractor/__init__.py, line 50, is wrong. The calling thread must always be attached to the JVM, even if there is only one.

rshiva commented 10 years ago

@tcalmant Thanks for the patch its working fine :) @originell I think you can close the issue in jpype since its from boilerpipe

originell commented 10 years ago

alright! Thanks for the clarification!

rshiva commented 10 years ago

@tcalmant Hey im running the same example in production with nginx and uwsgi but its breaking

 extractor = Extractor(extractor=extractorType, url=sourceUrl)

right in this line . Log doesnt show any error .It just gets stuck here .But its working independently as script in the python console .Any idea ..
(java version "1.7.0_51")

jimishjoban commented 10 years ago

To give more details, When we are trying via Python-rq (extracting article in the background) thats when it fails silently... I can see rq-worker running but nothing really happens...

tcalmant commented 10 years ago

Hi, I'm not a Python-rq, nor a nginx/uwsgi expert :( Could you provide a test case ?