jina-ai / executors

internal-only
Apache License 2.0
31 stars 12 forks source link

Executor-sentencizer new feature #246

Closed albertocarpentieri closed 2 years ago

albertocarpentieri commented 2 years ago

Add of smart_tokenizer parameter to segment function of Sentencizer executor in order to not consider dots in abbreviations. The feature uses nltk.tokenize.sent_tokenize, for which a new requirement is inserted.

albertocarpentieri commented 2 years ago

Hi, I do not understand the second comment, can you please explain it to me again? Furthermore, I received the email about the tests and I do not understand why it did not pass some of the tests.

Let me know,

Alberto Il venerdì 24 settembre 2021, 21:30:44 CEST, Joan Fontanals @.***> ha scritto:

@JoanFM commented on this pull request.

Looks like a really good PR, I have some small comments

In jinahub/segmenters/Sentencizer/sentencizer.py:

@@ -82,6 +85,32 @@ def init( r'\s([^{0}]+)(?<!\s)[{0}]'.format(''.join(set(self.punct_chars))) )

I think is better to make them private

In jinahub/segmenters/Sentencizer/sentencizer.py:

@@ -35,6 +36,8 @@ def init( :param punct_chars: the punctuation characters to split on, whatever is in the list will be used, for example ['!', '.', '?'] will use '!', '.' and '?'

  • If smart_tokenizer=True is passed to segment, punct_chars is

I think it would be good to have an init parameter that indeed can be overriden by parameters

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

JoanFM commented 2 years ago

The idea is that u are reading the parameters from the parameters in the call. but normally there is an object parameter determining the default behavior

albertocarpentieri commented 2 years ago

Ok thanks, I will add to my local branch this modification. I will send another pull request asap!

Il lunedì 27 settembre 2021, 13:26:13 CEST, Joan Fontanals ***@***.***> ha scritto:  

The idea is that u are reading the parameters from the parameters in the call. but normally there is an object parameter determining the default behavior

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

cristianmtr commented 2 years ago

Hey @albertocarpentieri

Any update on this?

albertocarpentieri commented 2 years ago

Hi Cristian,  it was a PR for the recruitment process. But when I'm free from work I'll continue on that.

Il venerdì 8 ottobre 2021, 14:11:39 CEST, cristian ***@***.***> ha scritto:  

Hey @albertocarpentieri

Any update on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

cristianmtr commented 2 years ago

Hey @albertocarpentieri

Thanks for the initiative.

However, we will close this PR for now, as it's been left unattended for some time. If you want to continue the work, let us know and we can re-open it.