Heading in Chinese - Githubissues

MonkandMonkey commented 6 years ago

I found that sumy will distinguish heading and other sentences, so checked the source code and I found that: Whether a line is heading is decided by str.isupper() function. But in a str composed by Chinese characters, if it contains an uppercase alphabet, the isupper() will return True, but actually it is just a normal sentence instead of heading.

For example:

s1 = "你好啊，这儿有N盘蛋糕可以吃。"
s2 = "N你好啊，这儿有盘蛋糕可以吃。"
s3 = "你好啊，这儿有盘蛋糕可以吃。"

s1.isupper()  # True
s2.isupper()  # True
s3.isupper()  # False

ArtificialNotImbecile commented 6 years ago

This is indeed problematic when you have Chinese paragraphs with capital English letters in it. Good catch! Using lower() to preprocess Chinese would be a SAFE choice before this bug fixed.

seven-linglx commented 5 years ago

I found that sumy will distinguish heading and other sentences, so checked the source code and I found that: Whether a line is heading is decided by str.isupper() function. But in a str composed by Chinese characters, if it contains an uppercase alphabet, the isupper() will return True, but actually it is just a normal sentence instead of heading.

For example:
s1 = "你好啊，这儿有N盘蛋糕可以吃。"
s2 = "N你好啊，这儿有盘蛋糕可以吃。"
s3 = "你好啊，这儿有盘蛋糕可以吃。"

s1.isupper()  # True
s2.isupper()  # True
s3.isupper()  # False

I encounter the same problem. And i rewrite the document() function in plaintext.py to deal with it.

miso-belica commented 5 years ago

@seven-linglx Can you share your solution with us? Can you add here the code snippet?

seven-linglx commented 5 years ago

@seven-linglx Can you share your solution with us? Can you add here the code snippet?

Of course, it's my honor to share with everyone, this is my rewrite function:

    def document(self):
        current_paragraph = []
        paragraphs = []
        for line in self._text.splitlines():
            line = line.strip()
            if line:
                current_paragraph.append(line)
            else:
                sentences = self._to_sentences(current_paragraph)
                paragraphs.append(Paragraph(sentences))
                current_paragraph = []

        sentences = self._to_sentences(current_paragraph)
        paragraphs.append(Paragraph(sentences))
        print(paragraphs)  # preview

        return ObjectDocumentModel(paragraphs)

In fact, maybe i don't solve this problem because i ignore the HEADING of document directly. In other words, it is suit the scene that there isn't HEADING in document, or you don't mind the program judge the HEADING as text.

This is the document for test:

" 1987年9月，以“民间科技企业”身份获深圳市工商局批准获得注册，注册资本2.1万元，员工14人，主要业务为代理中资控股的香港康力投资有限公司的HAX小型模拟交换机。这个名字意味“中华有为”。[2]成立早期代理小型程控交换机，在通信设备核心技术方面的第一次突破，是1994年推出的“C&C08”大型程控交换机，之后逐渐占据中国内固定交换机接入网等通信设备市场，市场份额逐渐扩大，至90年代末期已经在中国国内市场上与其他少数竞争对手共同占有大部分市场份额。至2007年，华为在光传输网络、移动及固定交换网络、数据通信网络几大领域内拥有较强实力，并在全球电信市场与爱立信、阿尔卡特、思科等老牌通讯公司展开激烈竞争。

技术有限公司在IT泡沫之前是一间籍籍无名的公司。但从IT泡沫之后该公司以中国为据点急速成长，快速吸引各界注目，市场不局限于发展中国家。 "

This is result:

1987年9月，以“民间科技企业”身份获深圳市工商局批准获得注册，注册资本2.1万元，员工14人，主要业务为代理中资控股的香港康力投资有限公司的HAX小型模拟交换机。

miso-belica commented 5 years ago

Thank you all. I think this is more tricky. I tried to find out some solution but seems I should introduce a new parser. Maybe MarkdownParser and let PlaintextParser really plain text but some summarizers use headings to give you better results. Or I could introduce some new parser for common annotated texts used for summarizations. I don't know.

I would like to ask you because I have no idea about Chinese texts.

Are there any common ways how to detect headings?
Does it even make sense to do it in Chinese?
Are there any common text formats used for Chinese texts for summarizations? Or NLP in general?
Is there anything else special from the English (or European) texts?

Thanks in advance and sorry for the really late reply. Have a nice day 🌞

seven-linglx commented 5 years ago

I am agree with you that let PlaintextParser really plain text, but you can provide a optional API in PlaintextParser that appoint the HEADING of plain text, instead of detect by PlaintextParser because the text with ideal format is difficult. let this work decided by user is better than introduce a new parser.

About your doubt:

In my opinion, there are not common method to detect heading. In format document, you can detect HEADING by the size of font which the HEADING tend to have large fonts. and in plain text, The first paragraph is a high probability of the HEADING.
I think the result of SUMY deal with chinese is not bad. At the same time, I am sorry that i can't provide more help because i have not enough experience about NLP. If i have more information i will update my reply.

miso-belica / sumy

Heading in Chinese #110

This is the document for test:

This is result: