hcss-utils / spacy-phrases

Extract phrases using spaCy.
2 stars 0 forks source link

capture "paragraph" #17

Closed hp0404 closed 2 years ago

hp0404 commented 2 years ago
hp0404 commented 2 years ago

implemented 'structural' solution in https://github.com/hcss-utils/spacy-phrases/commit/7aa01933e6f4ae58a6314d09901bd3e194150fcf, but for building new datasets in the future we still want to get actual paragraphs (before the 'cleaning' part removes '\n's)

Assumptions:

hp0404 commented 2 years ago

ok documenting stuff to pick it up tomorrow: I thought that setting custom boundaries would solve the issue

@Language.component("set_custom_boundaries") def set_custom_boundaries(doc): for token in doc[:-1]: if token.text.startswith("\n"): doc[token.i + 1].is_sent_start = True return doc

- changing sentencizer pipeline:
```python
config = {"punct_chars": ["\n", "\n\n"]}
sentencizer = nlp.add_pipe("sentencizer", config=config)

but instead of splitting on 'paragraphs', they split on both sentence and '\n' chars, so there should be another way of splitting on '\n's instead of on the default punct_chars.

hp0404 commented 2 years ago

funny enough, custom Language.component works once you remove 'parser' for the pipeline:

import spacy
from spacy.language import Language

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text.startswith("\n"):
            doc[token.i + 1].is_sent_start = True
    return doc

nlp = spacy.load("en_core_web_sm", disable=["parser"])
nlp.add_pipe("set_custom_boundaries", first=True)
hp0404 commented 2 years ago

tested on the following sample:

test = """Both versions convey a topic; it’s pretty easy to predict that the paragraph will be about epidemiological evidence, but only the second version establishes an argumentative point and puts it in context. The paragraph doesn’t just describe the epidemiological evidence; it shows how epidemiology is telling the same story as etiology. Similarly, while Version A doesn’t relate to anything in particular, Version B immediately suggests that the prior paragraph addresses the biological pathway (i.e. etiology) of a disease and that the new paragraph will bolster the emerging hypothesis with a different kind of evidence. As a reader, it’s easy to keep track of how the paragraph about cells and chemicals and such relates to the paragraph about populations in different places.\n A last thing to note about key sentences is that academic readers expect them to be at the beginning of the paragraph. (The first sentence this paragraph is a good example of this in action!) This placement helps readers comprehend your argument. To see how, try this: find an academic piece (such as a textbook or scholarly article) that strikes you as well written and go through part of it reading just the first sentence of each paragraph. You should be able to easily follow the sequence of logic. When you’re writing for professors, it is especially effective to put your key sentences first because they usually convey your own original thinking. It’s a very good sign when your paragraphs are typically composed of a telling key sentence followed by evidence and explanation.\n\n
Knowing this convention of academic writing can help you both read and write more effectively. When you’re reading a complicated academic piece for the first time, you might want to go through reading only the first sentence or two of each paragraph to get the overall outline of the argument. Then you can go back and read all of it with a clearer picture of how each of the details fit in. And when you’re writing, you may also find it useful to write the first sentence of each paragraph (instead of a topic-based outline) to map out a thorough argument before getting immersed in sentence-level wordsmithing."""

seems to work on the sample, but I couldn't run in on test-corpora though - need to check before merging

hp0404 commented 2 years ago

I think the issue might be caused by the lack of \n chars, when there are none in the text. So then I need to add boundaries manually, overwriting \n rule (first char of the doc is a start, the last one - is the end)

hp0404 commented 2 years ago

It appears that doc's first character is automatically set to sent_start, so the solution was to explicitly annotate each token as either start of the sentence or not (as opposed to only annotating true cases).

hp0404 commented 2 years ago

Also since we now have a proper solution (more or less), I pushed another commit c992da2ad391a34691a4d9f61a4fd808c9a5a62e adjusting func name in dep_matcher script to make it clear that we optionally capture context as one sent before and after the 'relevant' one, not paragraphs as delimited with \n or other special characters. This is because we might pass 'clean' texts into dep_matcher (without \n's), so it would be unable to correctly identify paragraphs, but would still capture 'context'