mozillazg / python-pinyin

汉字转拼音(pypinyin)
https://pypinyin.readthedocs.io
MIT License
4.92k stars 612 forks source link

保留分词结构? #299

Closed npuichigo closed 1 year ago

npuichigo commented 1 year ago

问题描述

已经有分词结果的情况下,如何让返回的拼音保持分词的结构呢?

问题复现步骤

pinyin = pypinyin.lazy_pinyin(['我', '是', 'NBA', '的', '球员'])
assert pinyin == ['wo3', 'shi4', 'NBA', 'de5', 'qiu2', 'yuan2']

pypinyin的返回结果是flatten的拼音序列,正常情况下数一数每个词的字数就能和pinyin结果对应上。但是出现了外文的时候,这种方法就失效了,需要额外考虑原始输入中哪些部分转换失败了

Expected output

pinyin = pypinyin.lazy_pinyin2(['我', '是', 'NBA', '的', '球员'])
assert pinyin == [['wo3'], ['shi4'], ['NBA'], ['de5'], ['qiu2', 'yuan2']]
mozillazg commented 1 year ago

当前可以通过类似下面这样的方法间接实现:

In [1]: from pypinyin import lazy_pinyin

In [2]: def lazy_pinyin2(words):
   ...:     for w in words:
   ...:         yield lazy_pinyin(w)
   ...:

In [3]: list(lazy_pinyin2(['我', '是', 'NBA', '的', '球员']))
Out[3]: [['wo'], ['shi'], ['NBA'], ['de'], ['qiu', 'yuan']]
npuichigo commented 1 year ago

@mozillazg 感谢回答,那么这种对每个word分别调用lazy_pinyin的方法在效果上和性能上,与整句话直接调用相比有什么区别吗

npuichigo commented 1 year ago

@mozillazg ping

mozillazg commented 1 year ago

@npuichigo 把 yield lazy_pinyin(w) 改成 yield lazy_pinyin([w]) 后就没啥区别了。 lazy_pinyin 内部的实现也是 for 循环遍历处理传入的 list