Japanese: add a sub-word split hint attribute to specific particles within Concepts

makorin0315 commented 2 years ago

This issue came out of a conversation with @Rei-hub and Dr. Torikai from Gunma University Hospital on March 16th, 2022 JST.

While iknowpy's ability to identify word groups that carry indivisible meaning has many benefits, it is difficult to get word frequency information from the Concepts identified. To assist in splitting some of the typically grouped words, we would like to explore the possibility of adding attributes to specific particles that exist within Concepts.

The particle の that behaves like 's in English For example: 3年以内の薬のネット販売 Make both instances of の to have the attribute JPsplitNO
The particle な that is preceded by Adjectival Nouns and succeeded by another Concept For example: 無力な国連 Make the character な to have the new attribute JPsplitNA

In both cases, the word group Concept should stay as single Concept. There already exist rules that combine the surrounding Concepts into a single Concept, so it should be relatively easy to achieve.

Rei-hub commented 2 years ago

I really appreciate your raising my request as a new issue. As you mentioned, Concepts in iknowPy often consist of a group of words, and this unique and powerful feature allows us to interpret the huge texts only by a few Concepts.

In contrast, in Japanese medical texts, disease or symptom names are often modified by words describing degree, time, body parts. e.g. “consolidation周囲の胸膜陥入” (In English, “Pleural indentation around consolidation”)

This causes a single word ”胸膜陥入" and compound words “consolidation周囲の胸膜陥入” to be recognized as two different symptom expressions, even though they contain essentially the same symptoms. In the case of developing a machine learning model after word segmentation by iknowPy, “subConcepts”, finer units of Concepts, should also be detected for the consolidation of the same symptom names.

Actually, the accuracy of my machine learning model for disease discrimination was slightly improved when the Concepts were additionally split before and after “の” or “な”.

As I mentioned earlier, original Concepts are a very useful and unique feature, so I think both Concepts and “subConcepts” are necessary. It's just an idea, but I think the combination of these features can be achieved by dividing into subConcepts by default and implementing “Concept Id”, which is assigned to the component of the same Concept as the same number, like Sentence Id or Source Id. (an example in my research is attached.)

If there is anything I can do such as validation in the real text, please let me know. Thank you.

bdeboe commented 2 years ago

This is an interesting subject that seems fairly unique to Japanese, as in most Western languages we'd have split those sub-elements into separate concepts in the default iKnow output. You would still see a difference in specificity between, for example, "pleural indentation" and "indentation", but unlike the example in Japanese, splitting these further is not desirable. In the IRIS-embedded version, we offer queries for getting "similar" entities and a dictionary matching infrastructure that help navigate these concepts with differing levels of specificity, and we have experimented with exposing less-specific concepts as additional features to ML algorithms.

This request leaves me wondering how much of this additional splitting for Japanese requires linguistic rule processing (e.g. in the form of attributes), or whether, if only for the purpose of experimentation, simply post-processing the output and splitting iknowpy output on these two particles using Python's split() function is appropriate. Going through a semantic attribute (the Generic Attribute comes to mind) may actually require a lot more Python code on the application side to navigate and interpret this output.

makorin0315 commented 2 years ago

@bdeboe - As mentioned in our call this morning, post-processing can do some of the splitting correctly, but considering that the characters の and な can be part of nouns (i.e., not used as particles), assigning the attributes in the language model would result in better output. This is because the rules have more concrete information about the entities preceding & succeeding the particle characters.

makorin0315 commented 2 years ago

The initial work for the new attribute "JPno_join_Con" is complete with the recent commit above. When the character の is used within a Concept and if it acts as a adominal or attributive particle to join 2 Nouns, the character now would retain the attribute label at the end of processing.

@Rei-hub, there are other expressions that may also benefit from similar implementation. For example:

ロームは半導体の原材料調達や生産での協業を探る見通し。→ での within 生産での協業
記者団にロシアからの攻撃か聞かれた。→ からの within ロシアからの攻撃

Also, there are some additional phrases that are currently not implemented with JPno_join_Con attribute in this commit/issue due to the nature of the phrases:

加盟国のどこかが武力攻撃を受ける可能性。→ 加盟国のどこか
すべての締結国への攻撃とみなされる。→ すべての締結国
内需のもう一つの柱である設備投資。 → もう一つの柱
台湾海峡の平和と安定の重要性も改めて強調した。 →　安定の重要性
抗体陽性のlong-covidはまだ報告されていません。 → 抗体陽性のlong-covid

I would appreciate if you can provide feedback for this latest implementation. I will close this issue and create another one to track the above items, to be discussed further.

intersystems / iknow

Japanese: add a sub-word split hint attribute to specific particles within Concepts #234