nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

Update PySBD component to support spaCy v3 #114

Open nipunsadvilkar opened 2 years ago

nipunsadvilkar commented 2 years ago

PySBD component using Language.factory

codecov-commenter commented 2 years ago

Codecov Report

Merging #114 (e07808a) into master (5905f13) will decrease coverage by 0.08%. The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master     #114      +/-   ##
==========================================
- Coverage   98.43%   98.35%   -0.09%     
==========================================
  Files          38       39       +1     
  Lines        1150     1153       +3     
==========================================
+ Hits         1132     1134       +2     
- Misses         18       19       +1     
Flag Coverage Δ
unittests 98.35% <50.00%> (-0.09%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pysbd/utils.py 73.33% <42.85%> (-2.53%) :arrow_down:
pysbd/about.py 100.00% <100.00%> (ø)
pysbd/__init__.py 100.00% <0.00%> (ø)

:mega: Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

davidberenstein1957 commented 2 years ago

Are you still working on this? Otherwise I could have a look.

nipunsadvilkar commented 2 years ago

Hey @davidberenstein1957, sure you can take a look at it. But I'm not sure what would be best way since I want to keep pysbd lightweight and to support psybd with spacy v3 with Language.factory is needed and which would make me add spacy as dependency.

Let me know if you happen to work on the recommendations suggested by @rmitsch above.

rbroderi commented 9 months ago

here would be an option to update the factory method and not require spacey as a hard requirement to pysbd.

from typing import Any
try:
    from spacy.language import Language
    langfac = Language.factory
except ImportError:
    def langfac(*args:Any,**kwargs:Any):
        def decorator(function:Any):
            def wrapper(*args:Any, **kwargs:Any):
                pass
            return wrapper
        return decorator
@langfac(name="pysbd",default_config={"language": 'en'})
class PySBDFactory(object):
    """pysbd as a spacy component through entrypoints"""

    def __init__(self, nlp, name,language='en'):
        self.nlp = nlp
        self.name = name
        self.seg = pysbd.Segmenter(language=language, clean=False,
                                   char_span=True)

    def __call__(self, doc):
        sents_char_spans = self.seg.segment(doc.text_with_ws)
        start_token_ids = [sent.start for sent in sents_char_spans]
        for token in doc:
            token.is_sent_start = (True if token.idx
                                   in start_token_ids else False)
        return doc

`