ljvmiranda921 / comments.ljvmiranda921.github.io

Blog comments for my personal blog: ljvmiranda921.github.io
1 stars 0 forks source link

spaCy Internals: Rules-based rules! #50

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

spaCy Internals: Rules-based rules!

spaCy has a comprehensive way to define rules for matching tokens, phrases, entities (and more!) to enhance statistical models. In this blog post, I'll share...

https://ljvmiranda921.github.io/notebook/2022/12/25/rules-based-rules/

alphomeg commented 1 year ago

Hi Thank you for the great Article, i'm having a problem running assemble command using your provided ruler.cfg file, the error i'm getting is as follow

alphomeg commented 1 year ago
✘ Error parsing config section. Perhaps a section name is wrong?
initialize -> components -> span_ruler  Section 'components' is not defined
{'nlp': {'pipeline': ['tok2vec', 'ner', 'span_ruler']}, 'components': {'ner': {'source': '/content/drive/MyDrive/output_spacy/model-best'}, 'span_ruler': {'factory': 'span_ruler', 'spans_key': None, 'annotate_ents': True, 'ents_filter': {'@misc': 'spacy.prioritize_new_ents_filter.v1'}, 'validate': True, 'overwrite': False}, 'tok2vec': {'source': '/content/drive/MyDrive/output_spacy/model-best'}}, 'initialize': {}}
alphomeg commented 1 year ago

can you please help

ljvmiranda921 commented 1 year ago

Hi sorry about that, I wasn't able to mention that the ruler.cfg is just an excerpt. Will update in a few. I suggest looking at the example project instead (this is from a forked PR, we'll merge this very soon to the main projects repository) instead to see the full config.

alphomeg commented 1 year ago

Hi Thanks for clarifying, Much appreciated :)

Kau832 commented 1 year ago

Hi :)

Many thanks for this post as it clarified the use of span_ruler a bit closer. I have, however, some issues with understanding the pipeline architecture when using a span_ruler and spancat.

I have used simple TEXT/lower patterns that match whole sentences and used sentencizer as an annotating component and as a component in the pipeline (["sentencizer","tok2vec","spancat"], in this order). This worked even though I had no [components.span_ruler] in my training config.

I now used a pattern similar to the one you posted, with an additional ENT_TYPE pattern, and the training returns 0.00 scores on all scoring metrics. Do I need to pass any component to annotating_components = []?

Currently, my pipeline components are: ["tok2vec", "spancat", "span_ruler"] and the span_ruler and spancat components are:

[components.span_ruler]
factory = "span_ruler"
spans_key = "ruler"
validate = true
overwrite = false

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "ruler"
threshold = 0.5

Since data debug finds no issues with my training data, I assume the issue must be with either 1) the order of my components in which they are initialized or 2) the parameters in the config itself.

Thanks a lot for any help and apologies for reaching out here instead of on Github.

Kau832 commented 1 year ago

To add to that, my config.cfg in the trained (with 0.00 scorer, so not really) model looks like this:

[components.span_ruler]
factory = "span_ruler"
annotate_ents = false
ents_filter = {"@misc":"spacy.first_longest_spans_filter.v1"}
matcher_fuzzy_compare = {"@misc":"spacy.levenshtein_compare.v1"}
overwrite = false
phrase_matcher_attr = null
spans_filter = null
spans_key = "ruler"
validate = true

[components.span_ruler.scorer]
@scorers = "spacy.overlapping_labeled_spans_scorer.v1"
spans_key = "ruler"

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "ruler"
threshold = 0.5