machinalis / iepy

Information Extraction in Python
BSD 3-Clause "New" or "Revised" License
905 stars 186 forks source link

Unable to retrace the segments from the text #115

Closed SindhuBairavi closed 7 years ago

SindhuBairavi commented 7 years ago

When I use rules to fetch evidence candidates, the output only contains segment id and true/false. How do I fetch the segment text from the content? I understand that I can use the offsets, but these offsets are not character level, they seem to be on tokens. Which work tokeniser/segmenter is used to replicate the same? Else the number may not match.

Any help will be appreciated! Thanks!

jmansilla commented 7 years ago

Hi there.

The csv contains EvidenceCandidates ids. Each EvidenceCandidate contains: a text segment, a left entity occurrence, a right entity occurrence and a relation.

For visualizing everything together you can use the TerminalEvidenceFormatter (available at here from iepy.extraction.terminal import TerminalEvidenceFormatter)

You can see an example of such formatter in use here https://github.com/machinalis/iepy/blob/develop/iepy/instantiation/rules_verifier.py

SindhuBairavi commented 7 years ago

If I don't want to use the UI, but want to just generate the list of sentences/segments which the rules have marked as evidence candidates, then how to fetch the segments?

jmansilla commented 7 years ago

Here I go again.

As I tried to explain before, a iepy runner will return CandidateEvidences, which is a piece of text with some important pieces highlighted (the EntityOcurrences). In some cases, the same piece of text may be part of several different CandidateEvidences. Example, consider the following text:

"Peter was born in 1916, he married Anna in 1930, and died in 1950"

If you have a relation "Person" - "Date", you would have the following 6 CandidateEvidences:

That's why I still insist that you may need not only the "sentence or text" but all the information.

Moreover, the TerminalEvidenceFormatter I mentioned before it's a tool that prints in standard output a piece of text, highlighting with different colors the correspondent entity occurrences. If you dont want exactly that, you could adapt to your needs this piece of code: function "colored_text" here https://github.com/machinalis/iepy/blob/develop/iepy/extraction/terminal.py#L141:L166

Hope it helps

On Thu, Dec 8, 2016 at 7:50 AM, Sindhu Bairavi notifications@github.com wrote:

If I don't want to use the UI, but want to just generate the list of sentences/segments which the rules have marked as evidence candidates, then how to fetch the segments?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/machinalis/iepy/issues/115#issuecomment-265711431, or mute the thread https://github.com/notifications/unsubscribe-auth/AAd04yyqRUj4Bty_aaGIEWh0n6-imhq_ks5rF-DigaJpZM4K5QPE .

-- Javier Mansilla - Technical Leader www.machinalis.com