How to make Ollie or OpenIE work with medium size of data in Java project

knowitall / ollie

Ollie is a open information extractor that uses bootstrapped dependency paths.

http://knowitall.github.io/ollie/

Other

243 stars 76 forks source link

How to make Ollie or OpenIE work with medium size of data in Java project #27

Closed Yongyao closed 6 years ago

Yongyao commented 8 years ago

Hi everyone,

I am trying to use either Ollie or OpenIE to extract knowledge from around 100 - 1,000 ocean science web pages. I have imported OpenIE (4.2) into my Java project through maven. It works well with several sentences, but once it gets to more than 2 pages, I starts to see the "out of memory error". I have set the heap size to -Xmx2700m.

Is there any way to make it work without modifying the source code (I don't know scala)?

BTW, I also tried the Stanford OpenIE as well. Although it doesn't give me any error so far, the triples extracted by it were pretty messy.

Thanks, Cody

swarnaHub commented 8 years ago

Could you increase the memory options like change it to -Xmx4g. If it still doesn't work, let me know approximately the number of sentences you are working on.

Yongyao commented 8 years ago

The web page I was trying to process is this: https://podaac.jpl.nasa.gov/ADEOS-II , which is a pretty small page.

Yes, it works when I increased it to 4g, but what if I want to process a large doc like an academic paper. After I looked at the code, I start to wonder if it's really necessary to need to pass the entire text as a string into OpenID, because I didn't see any dependency when experimenting with "Obama is the US president. He works in White House." Therefore, I split raw text into sentences with OpenNLP, and pass each of them to openIE(sentence[i]). It works with the original heap size (2700m).

I compared these two set of results, there is slight difference, which makes me think there might be some additional processing inside. Also, it looks like stanford OpenIE (http://nlp.stanford.edu/software/openie.html) is somehow built on Ollie or Reverb, but why their results are so different with the same input.

swarnaHub commented 8 years ago

There is actually no need to pass the entire text as a string into OpenIE. The OpenIE demo (https://github.com/OpenIE-HelperCodes/OpenIEDemo1/blob/master/src/runner/RunMe.java) shows the usage for a single sentence, and one is expected to pass a single sentence at a time.

mishumausam commented 8 years ago

Also. While the current version is designed keeping the lessons from ReVerb and Ollie in mind, the code by itself has no direct relation to them. So the results are expected to be different, typically better; but for some sentences, could be worse.

Thanks! Mausam Sent from my phone

On Jul 15, 2016, at 12:25 PM, Swarnadeep Saha notifications@github.com wrote:

There is actually no need to pass the entire text as a string into OpenIE. The OpenIE demo (https://github.com/OpenIE-HelperCodes/OpenIEDemo1/blob/master/src/runner/RunMe.java) shows the usage for a single sentence, and one is expected to pass a single sentence at a time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Yongyao commented 8 years ago

Alright. Thanks guys. It looks like both package cannot handle pronouns, since they don't know "he" in the second sentence is "Obama", given a sentence "Obama is the US president. He works in White House." In this case, I think I can just split the sentence by myself anyway.

mishumausam commented 8 years ago

That's correct. You can easily combine it with any coreference resolution system and alpha-substitute pronouns to referents.

Thanks! Mausam Sent from my phone

On Jul 15, 2016, at 1:39 PM, Yongyao Jiang notifications@github.com wrote:

Alright. Thanks guys. It looks like both package cannot handle pronouns, since they don't know "he" in the second sentence is "Obama", given a sentence "Obama is the US president. He works in White House." In this case, I think I can just split the sentence by myself anyway.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.