Add initial BestMatch logic to Crossmap endpoint

blu3id commented 5 years ago

An initial attempt a creating BestMatch logic for complex map sets - specifically ICD-10 - to the Crossmap endpoint.

The logic currently is as follows:

Exclude inactive concepts as potential targets
For Complex mappings ensure that matching 'map_target` has the highest priority for that concept (this is very crude and ignores does not account for cases with multiple map_groups)
Determine which concepts (that haven't been excluded as above) subsume which others
Pick the concept that subsumes the most other concepts (not all)
If there is no winner then determine which concept has the shortest path to root and pick that

The following issues need to be resolved/improved:

Expensive ~~non-parallel (goroutined)~~ use of AllParents which for codes with lots of entries like C509 can cause significant wait (>10s).
Expensive ~~non-parallel (goroutined)~~ use of ShortestPathToRoot (though cant find an edge case yet mostly gets called when there are only 1-2 candidate concepts e.g C508
Edge cases where there are no active concepts (though this may not be a "bug")
Where there are multiple map_groups / general improvement of logic around map_priority e.g 315004001 |Metastasis from malignant tumor of breast (disorder) when searching for C509

Edit: Now with ~50% performance boost with goroutines on expensive loops

The following also needs to be added to the upstream snomed.proto for this PR to work:

message TranslateFromResponse {
    repeated Item translations = 1;
    int64 BestMatch = 2;

    message Item {
        ReferenceSetItem reference_set_item = 1;
...

This also partially fixes a bug with import of map_target and index logic in leveldb-service.go:

Imports of map_targets where including an extra space at the end of codes that ended with a number (didn't affect codes like G35X, T66X) causing problems with determining RefsetComponent key from index entry.

However there is still a bug in the index logic as trying e.g G400 to G409 causes the same error that was occurring when an extra space was being imported. Suspect there needs to be an uncommon delimiter e.g @ or / between compound keys and/or target key stored as value of index entry.

wardle commented 5 years ago

I like it.

Do you think we should optimise the AllParents() first... because that is used in lots of places and will be needed in ever more logic? In the old java version, I build a simple transitive closure table so it was fast, and essentially the framework is there in the KV store to store something similar that would probably fit all parents into a single mmap so v.quick.... I recently added the hooks for precomputations building and clearing.

But in any case, I had forgotten about the cross map priority value so this is looking good. Would be great to have some unit tests to check those edge cases and prove it's working as designed, and then easy to swap it all around and know we've not broken anything. TranslateResponse proto change will need to be lowercase.

wardle commented 5 years ago

What about sorting the item list to "best" match(es) first... and then having a generic match that subsumes all matches? Because how do we define "best" depends on usage I guess. For analytics, we might use most generic (ie lowest common subsumer of all matches - which may not be in the list of matches of course), but for display to the user, we might need different. So I'm tempted to merge and then rename to "generic_match" which would match the IDL for entity extraction as well.

wardle commented 5 years ago

Merged with plan for more changes...

blu3id commented 5 years ago

Do you think we should optimise the AllParents()

Probably a good idea before production deployment - would then suggest removing the hacky gouroutines that were added for performance. Also as above might be worth looking at how indexes are handled with the current leveldb code.

Would be great to have some unit tests to check those edge cases and prove it's working as designed..

Definitely, if there is one thing I'm good at it is forgetting to write test code.

That said there is some need to be slightly careful here as the current logic is surprisingly tolerant to errors - case in point 315004001 |Metastasis from malignant tumor of breast (disorder) when searching for C509 the current code currently definitely does not manage map_groups and map_priority correctly for this but the whole logic together doesn't pick this concept (correctly) for C509 or for C799.

So perhaps some of the logic should be broken up so components can be tested sperately and/or reused elsewhere (TODO...)

rename to "generic_match"

Yeah, I wasn't really sure where to slot this in API wise. Was even contemplating a separate API. The struggle is I view the current "crossmaps" endpoint as just providing "raw" matches as there is (prior to this) no logic handling complex maps (and even now given that some rely on real-time rule evaluation better suited to client apps rather than the terminology server) and the whole issue of mapping to inactive concepts.

There is also the interesting issue of where there are multiple ICD codes that when taken together would best match to a single SNOMED concept (reusing the same example) eg. C50.9 and C79.9 should map to 315004001 |Metastasis from malignant tumor of breast (disorder) rather than 128462008 | Secondary malignant neoplastic disease (disorder) and 254837009 | Malignant neoplasm of breast (disorder) though I guess expression parsing could help... the complexity always grows

wardle / go-terminology

Add initial BestMatch logic to Crossmap endpoint #1