Japanese: Request for some improvements of entity extraction algorithm in terms of more accurate analysis of medical colloquial text

Rei-hub commented 3 years ago

I’m Rei Noguchi from Gunma University Hospital, and I really appreciate the prompt implementation of “negation expansion “ in Japanese (#33). I’m now trying to analyze daily progress notes in electronic medical records, and unlike discharge summaries described as stylized documents, the progress notes are often written in a colloquial or narrative style and includes incomplete sentences, resulting in some problems. To analyze these casual text in the medical field more accurately, I would like to propose the following three improvements.

**1. Extract a word followed by +/- without parentheses as a single entity

Resolve the different entity extraction results depending on punctuation mark (Japanese period ”。” or just a space)
Detect time expression**

The details are as follows.

1. Extract a word followed by +/- without parentheses as a single entity

The previous improvement (#31) enabled Katakana or numbers enclosed in parentheses to be concatenated with the preceding Concept as a single entity. This works in many cases, especially in stylized documents, and is useful for identifying the relation of negation. (e.g. heart murmur(-) → no heart murmur) However, in informal text such as daily progress notes, there is a problem. Some entities are followed by +/- without parentheses. Even in these cases, +/- symbol should be concatenated with the preceding Concept as a single entity because doctors describe the text with the same intention, and this enables us to clarify the relation of negation. Is this improvement technically possible? Importantly, in many cases of these, there is often no space between an entity and +/-, whereas there is often half-width or full-width space after +/- to separate from the next entity.

2. Resolve the different entity extraction results depending on punctuation mark (Japanese period ”。” or just a space)

“熱はなし”（no fever）is extracted as a single entity at this time, probably because this phrase includes all hiragana homonym “はなし”. In contrast, if there is a punctuation mark (i.e. Japanese period “。”) in the end of the phrase, like “熱はなし。”, the phrase is divided into multiple entities. The latter case seems like a good option in terms of identifying negation relation. However, because doctors often end a sentence with just a “space” in place of Japanese period “。”, I think that a phrase ending with a space should be divided into multiple entities in the same manner as “。”.

3. Detect time expression

In medical progress notes, there are many time expressions, so that it’s very useful that they could be identified by something like markers. Some examples:

2015-06-16 12:47:42 -> 2015-06-16 (Date) + 12:47:42 (Time) or 2015-06-16 12:47:42 (Datetime)
「12月ごろ花粉症の内服処方」 (extracted as a single entity) -> 12月ごろ花粉症の内服処方 (Month) (in English: Around December prescription of medication for hay fever)

iKnow is an indispensable tool especially in a medical field, where there are many unknown words. I realize the great value of iKnow and expect further improvement. Thank you for your help.

makorin0315 commented 3 years ago

@Rei-hub - thank you very much for your requests. As discussed in our conversation yesterday, I believe most of these can be accommodated to your liking. I think it would be best to create an issue for each request, so that we can have focused discussions. With your permission, I would like to close this issue and open 3 new ones. Please let me know.

Rei-hub commented 3 years ago

@makorin0315 Thank you for your quick response. I really appreciate your positive consideration of my requests. In regards to your suggestion about creating an issue for each request, I completely agree with you and It's actually better that way. Thank you for your help.

makorin0315 commented 3 years ago

This issue has been split into 3 issues (#138, #140, #139). Closing.

intersystems / iknow