Open MonkandMonkey opened 6 years ago
This is indeed problematic when you have Chinese paragraphs with capital English letters in it. Good catch! Using lower()
to preprocess Chinese would be a SAFE choice before this bug fixed.
I found that sumy will distinguish heading and other sentences, so checked the source code and I found that: Whether a line is heading is decided by str.isupper() function. But in a str composed by Chinese characters, if it contains an uppercase alphabet, the isupper() will return True, but actually it is just a normal sentence instead of heading.
For example:
s1 = "你好啊,这儿有N盘蛋糕可以吃。" s2 = "N你好啊,这儿有盘蛋糕可以吃。" s3 = "你好啊,这儿有盘蛋糕可以吃。" s1.isupper() # True s2.isupper() # True s3.isupper() # False
I encounter the same problem. And i rewrite the document() function in plaintext.py to deal with it.
@seven-linglx Can you share your solution with us? Can you add here the code snippet?
@seven-linglx Can you share your solution with us? Can you add here the code snippet?
Of course, it's my honor to share with everyone, this is my rewrite function:
def document(self):
current_paragraph = []
paragraphs = []
for line in self._text.splitlines():
line = line.strip()
if line:
current_paragraph.append(line)
else:
sentences = self._to_sentences(current_paragraph)
paragraphs.append(Paragraph(sentences))
current_paragraph = []
sentences = self._to_sentences(current_paragraph)
paragraphs.append(Paragraph(sentences))
print(paragraphs) # preview
return ObjectDocumentModel(paragraphs)
In fact, maybe i don't solve this problem because i ignore the HEADING of document directly. In other words, it is suit the scene that there isn't HEADING in document, or you don't mind the program judge the HEADING as text.
" 1987年9月,以“民间科技企业”身份获深圳市工商局批准获得注册,注册资本2.1万元,员工14人,主要业务为代理中资控股的香港康力投资有限公司的HAX小型模拟交换机。这个名字意味“中华有为”。[2]成立早期代理小型程控交换机,在通信设备核心技术方面的第一次突破,是1994年推出的“C&C08”大型程控交换机,之后逐渐占据中国内固定交换机接入网等通信设备市场,市场份额逐渐扩大,至90年代末期已经在中国国内市场上与其他少数竞争对手共同占有大部分市场份额。至2007年,华为在光传输网络、移动及固定交换网络、数据通信网络几大领域内拥有较强实力,并在全球电信市场与爱立信、阿尔卡特、思科等老牌通讯公司展开激烈竞争。
技术有限公司在IT泡沫之前是一间籍籍无名的公司。但从IT泡沫之后该公司以中国为据点急速成长,快速吸引各界注目,市场不局限于发展中国家。 "
1987年9月,以“民间科技企业”身份获深圳市工商局批准获得注册,注册资本2.1万元,员工14人,主要业务为代理中资控股的香港康力投资有限公司的HAX小型模拟交换机。
Thank you all. I think this is more tricky. I tried to find out some solution but seems I should introduce a new parser. Maybe MarkdownParser
and let PlaintextParser
really plain text but some summarizers use headings to give you better results. Or I could introduce some new parser for common annotated texts used for summarizations. I don't know.
I would like to ask you because I have no idea about Chinese texts.
Thanks in advance and sorry for the really late reply. Have a nice day 🌞
I am agree with you that let PlaintextParser really plain text, but you can provide a optional API in PlaintextParser that appoint the HEADING of plain text, instead of detect by PlaintextParser because the text with ideal format is difficult. let this work decided by user is better than introduce a new parser.
About your doubt:
I found that sumy will distinguish heading and other sentences, so checked the source code and I found that: Whether a line is heading is decided by str.isupper() function. But in a str composed by Chinese characters, if it contains an uppercase alphabet, the isupper() will return True, but actually it is just a normal sentence instead of heading.
For example: