tarsqi / ttk

Tarsqi Toolkit
Apache License 2.0
25 stars 10 forks source link

Make Chunker sensitive to known terms #63

Closed marcverhagen closed 7 years ago

marcverhagen commented 7 years ago

Chunking is now purely based on tag sequences. Add some mechanism to use terms found in a document and make sure they are chunks or parts of chunks. For example, the string hypodense lesion within the caudate lobe is now chunked as

   [hypodense lesion]ng within [the caudate lobe]ng

But if we knew that the whole string is a term then we would chunk it as

   [hypodense lesion within the caudate lobe]ng

Incidentally, this should also solve issue #62.

marcverhagen commented 7 years ago

This was done, but minor issues remain.

For the chunks derived from terms we might want to consider adding all terms from all offsets and then merging them afterwards. So if we found an NG chunk from 1-3 and there was a term from 2-4 then we will not find a chunk that contains the term. Instead we would like 1-4 as a chunk. This is illustrated with

   [[showed bleeding]vg in [two vessels]ng

where the existence of the showed bleeding NG chunk prevents the term bleeding in two vessels to be found as a chunk. In this case, the NG chunk is probably an error, but there are cases where the chunker does the right thing.

This would be a more global change that is not restricted to the _consume_term() method.

The import also introduces some chunking errors that did not exist before. For example, in one of our test files where positive for protein was a term we now get a chunk with the wrong category:

   [positive for protein]ng

We do not allow chunks to be added that are not NG or VG chunks, but the simple chunk rules got away with just checking the head, which is not sufficient here.