ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
8 stars 5 forks source link

annotating longer strings with stop words or stop characters #207

Open graybeal opened 3 years ago

graybeal commented 3 years ago

In RCIT_A1 (private ontology), with screenshots omitted (sorry):

(1). For the classes ending with "L/min", 3 out of 6 are annotated out. Why half of them can be annotated, but the rest cannot? (2). For concepts like: "3 days ago", "3 days later", "three days ago" ( which is the synonym of "3 days ago") are not successfully recognized, but wrongly annotated by another class "3 days". Since they are 3-character concepts, they should meet with the indexing rule you mentioned.

AND

"Admitted with acute respiratory failure" and "admitted for heart failure" are not annotated.

see also #206

graybeal commented 3 years ago

It seems to me that '6-10', '11-15', and '1-5' all pass as tokens >= 3 characters; '6' and '> 15' do not ('> 15' is 2 tokens separated by a space, and each token is 1 or 2 characters).

I can't speak to the details of the second set of strings. we'll see if anyone else in the team or on this list can speak to them.

graybeal commented 3 years ago

Regarding the third case, what is likely happening is the stop words 'for' and 'with' break up the longer string. This is somewhat analogous to cases like '> 15' where the space means the tokens on either side are ignored.

I'm wondering whether the string patterns should include stop words and tokens shorter than 3 characters? Can this be indexed separately, i..e, all tokenized strings longer than 3 characters get indexed even if they contain stop words or spaces? Might have to be a different process if the stop words are processed as the first step.