mcs07 / ChemDataExtractor

Automatically extract chemical information from scientific documents
http://chemdataextractor.org
MIT License
287 stars 112 forks source link

_in_stoplist should return True for entities trimmed out of existence #12

Closed dan2097 closed 7 years ago

dan2097 commented 7 years ago

In an entity like "-aromatic" which is in IGNORE_SUFFIX the resultant entity after running _in_stoplist is of length 0, hence the entity should be ignored (i.e. the function should return True) rather than reporting a 0 length entity.

On an entity which is both in IGNORE_PREFIX and IGNORE_SUFFIX you can get into a situation where the end index is actually before the start end index!

d = Document("non-aromatic") d.cems [Span(u'', 4, 3)]

I assume adding this check that the resultant entity's length is > 0 will fix that case as well.

mcs07 commented 7 years ago

Oops! Nice catch, thanks.