`'icu.BreakIterator' object has no attribute 'getRuleStatus'` when using locale like `en@ss=standard`

qtdaniel commented 4 years ago

I would like to use the sentence break filters, i.e. lists of common abbreviations, that adjust the behaviour of the sentence tokeniser as documented here: http://userguide.icu-project.org/boundaryanalysis#TOC-Sentence-Break-Filter

I was expecting to be able to achieve this by simply changing my locale string from "en" to "en@ss=standard" but this causes the error 'icu.BreakIterator' object has no attribute 'getRuleStatus' when getRuleStatus is called on the resulting break iterator.

I've tried this via icu.BreakIterator.createSentenceInstance and via icu.RuleBasedBreakIterator.createSentenceInstance but both exhibit the same error.

Here's is a minimal reproduction:

import icu

bi = icu.BreakIterator.createSentenceInstance(icu.Locale("en"))
print(bi.getRuleStatus())

bi = icu.RuleBasedBreakIterator.createSentenceInstance(icu.Locale("en"))
print(bi.getRuleStatus())

bi = icu.BreakIterator.createSentenceInstance(icu.Locale("en@ss=standard"))
try:
    print(bi.getRuleStatus())
except Exception as exception:
    print(
        "Failed to get rule status when using en@ss=standard locale with BreakIterator",
        exception
    )

bi = icu.RuleBasedBreakIterator.createSentenceInstance(icu.Locale("en@ss=standard"))
try:
    print(bi.getRuleStatus())
except Exception as exception:
    print(
        "Failed to get rule status when using en@ss=standard locale with"
        " RuleBasedBreakIterator",
        exception
    )

When run, that code emits

0
0
Failed to get rule status when using en@ss=standard locale with BreakIterator 'icu.BreakIterator' object has no attribute 'getRuleStatus'
Failed to get rule status when using en@ss=standard locale with RuleBasedBreakIterator 'icu.BreakIterator' object has no attribute 'getRuleStatus'

I am using icu and pyicu via conda-forge:

icu                       64.2                 he1b5a44_1    conda-forge
pyicu                     2.4.2            py37h8412b87_0    conda-forge
python                    3.7.6           cpython_h8356626_6    conda-forge

ovalhub commented 4 years ago

On Tue, 23 Jun 2020, qtdaniel wrote:

I would like to use the sentence break filters, i.e. lists of common abbreviations, that adjust the behaviour of the sentence tokeniser as documented here: http://userguide.icu-project.org/boundaryanalysis#TOC-Sentence-Break-Filter

I was expecting to be able to achieve this by simply changing my local string from "en" to "en@ss=standard" but this causes the error 'icu.BreakIterator' object has no attribute 'getRuleStatus' when getRuleStatus is called on the resulting break iterator.

I've tried this via icu.BreakIterator.createSentenceInstance and via icu.RuleBasedBreakIterator.createSentenceInstance but both exhibit the same error.

Thank you for the detailed reproduction steps. I was able to reproduce the issue as described. At first glance, it looks like:

bi = icu.BreakIterator.createSentenceInstance(icu.Locale("en")) bi <RuleBasedBreakIterator: 0x7ff74dc0f7b0> class RuleBasedBreakIterator has method getRuleStatus()

bi = icu.RuleBasedBreakIterator.createSentenceInstance(icu.Locale("en@ss=standard")) bi <BreakIterator: 0x7ff74de0cb00> class BreakIterator does not have this method

Why is one call creating a RuleBasedBreakIterator and the other just a BreakIterator is a question for the ICU users list. I don't know the answer myself. The getRuleStatus() is defined on both C++ classes but is missing on the PyICU BreakIterator wrapper. It looks like it appeared on BreakIterator in ICU 52 only (while it's been on RuleBasedBreakIterator since ICU 2.2). This explains the oversight. I now added the missing wrapper in HEAD.

Thank you for the report !

Andi..

Heres is a minimal reproduction:

import icu

bi = icu.BreakIterator.createSentenceInstance(icu.Locale("en"))
print(bi.getRuleStatus())

bi = icu.RuleBasedBreakIterator.createSentenceInstance(icu.Locale("en"))
print(bi.getRuleStatus())

bi = icu.BreakIterator.createSentenceInstance(icu.Locale("en@ss=standard"))
try:
   print(bi.getRuleStatus())
except Exception as exception:
   print(
       "Failed to get rule status when using en@ss=standard locale with BreakIterator",
       exception
   )

bi = icu.RuleBasedBreakIterator.createSentenceInstance(icu.Locale("en@ss=standard"))
try:
   print(bi.getRuleStatus())
except Exception as exception:
   print(
       "Failed to get rule status when using en@ss=standard locale with"
       " RuleBasedBreakIterator",
       exception
   )

When run, that code emits

0
0
Failed to get rule status when using en@ss=standard locale with BreakIterator 'icu.BreakIterator' object has no attribute 'getRuleStatus'
Failed to get rule status when using en@ss=standard locale with RuleBasedBreakIterator 'icu.BreakIterator' object has no attribute 'getRuleStatus'

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ovalhub/pyicu/issues/133

qtdaniel commented 4 years ago

Thanks.

I've tried building pyicu from source so I can test this but the build failed. I'm afraid I won't be able to spend the time needed to resolve this right now since I won't be able to make use of the change until the package is updated in conda-forge anyway.

The change sounds like it should resolve the problem so I think this issue can be closed now.

ovalhub / pyicu

`'icu.BreakIterator' object has no attribute 'getRuleStatus'` when using locale like `en@ss=standard` #133