wardle / hermes

A library and microservice implementing the health and care terminology SNOMED CT with support for cross-maps, inference, fast full-text search, autocompletion, compositional grammar and the expression constraint language.
Eclipse Public License 2.0
177 stars 21 forks source link

Does not recognise simple refsets without "Simple" in the (actually optional) Summary of the "ContentSubType" element. #31

Closed sidharthramesh closed 3 years ago

sidharthramesh commented 3 years ago

Hey @wardle, I've been having issues querying members of Refsets using the constrain parameter of the API.

Version: v0.8.1

Querying one of the refsets to get its members using: http://localhost:8080/v1/snomed/search?constraint=^1131000189100

Gives a 404 Not Found error.

The release files I've used to index and search can be found here.

The query http://localhost:8080/v1/snomed/search?constraint=^1101000189108 - Members of 1101000189108 |CTV3 simple map reference set (foundation metadata concept)| refset seems to work just fine.

Investigating further, looking at one of the Refset files ./SnomedCT_IndiaReferenceSetsRF2_PRODUCTION_202108067T120000Z/Snapshot/Refset/Content/der2_Refset_cardiologySnapshot_IN1000189_20210806.txt:

id  effectiveTime   active  moduleId    refsetId    referencedComponentId
a60050b1-8079-49ab-a3ad-3cb59ed33bdc    20201127    1   1121000189102   1131000189100   1001000119102

The concept 1001000119102 |Pulmonary embolism with pulmonary infarction (disorder)| has only the following refsets:

  "refsets": [
    900000000000497000,
    447562003
  ]

and does not include 1131000189100 which is part of the file used to index. However, the concept 1131000189100 does exist in the server.

I believe the issues can be replicated by just using the packages: ./SnomedCT_IndiaReferenceSetsRF2_PRODUCTION_202108067T120000Z and ./SnomedCT_InternationalRF2_PRODUCTION_20210131T120000Z (link here). It might have something to do with the naming conventions and directory structure of the files?

wardle commented 3 years ago

If you use the "list" command at the command line it will tell you what files it finds. It may be being too strict with file name conventions.

Also, after import try using the status command to list the installed reference sets to double check what is installed, or not.

Also, is the file in question listed as a file imported during import?

dharsanb commented 3 years ago

On listing the refsets in SnomedCT_IndiaReferenceSetsRF2_PRODUCTION_202108067T120000Z, only 6 out of 32 files in the directory were listed. Directory and output of list pasted below.

Directory: der2_cRefset_AssociationReferenceSnapshot_IN1000189_20210806.txt der2_cRefset_AttributeValueSnapshot_IN1000189_20210806.txt der2_Refset_cardiologySnapshot_IN1000189_20210806.txt der2_Refset_cardiothoracicAndVascularSurgerySnapshot_IN1000189_20210806.txt der2_Refset_cataractSnapshot_IN1000189_20210806.txt der2_Refset_cervicalCancerSnapshot_IN1000189_20210806.txt der2_Refset_childhoodDiarrheaSnapshot_IN1000189_20210806.txt der2_Refset_dengueSnapshot_IN1000189_20210806.txt der2_Refset_dermatologySnapshot_IN1000189_20210806.txt der2_Refset_emergencySnapshot_IN1000189_20210806.txt der2_Refset_fetalMedicineSnapshot_IN1000189_20210806.txt der2_Refset_gastroenterologySnapshot_IN1000189_20210806.txt der2_Refset_generalSurgerySnapshot_IN1000189_20210806.txt der2_Refset_geriatricsSnapshot_IN1000189_20210806.txt der2_Refset_iodineDeficiencySnapshot_IN1000189_20210806.txt der2_Refset_leprosySnapshot_IN1000189_20210806.txt der2_Refset_lymphaticFilariasisSnapshot_IN1000189_20210806.txt der2_Refset_malariaSnapshot_IN1000189_20210806.txt der2_Refset_nephrologySnapshot_IN1000189_20210806.txt der2_Refset_neurologySnapshot_IN1000189_20210806.txt der2_Refset_neurosurgerySnapshot_IN1000189_20210806.txt der2_Refset_obstetricsAndGynecologySnapshot_IN1000189_20210806.txt der2_Refset_oncologySnapshot_IN1000189_20210806.txt der2_Refset_oralCancerSnapshot_IN1000189_20210806.txt der2_Refset_orthopedicsSnapshot_IN1000189_20210806.txt der2_Refset_pediatricsSnapshot_IN1000189_20210806.txt der2_Refset_pregnancyRelatedAnemiaSnapshot_IN1000189_20210806.txt der2_Refset_psychiatrySnapshot_IN1000189_20210806.txt der2_Refset_radiologySnapshot_IN1000189_20210806.txt der2_Refset_rheumatologySnapshot_IN1000189_20210806.txt der2_Refset_strokeSnapshot_IN1000189_20210806.txt der2_Refset_tuberculosisSnapshot_IN1000189_20210806.txt

OUTPUT:

================================================================================================= | Distribution files in D:/SNOMED/SnomedCT_IndiaReferenceSetsRF2_PRODUCTION_202108067T120000Z:6 |

| :filename | :component | :version-date | :format | :content-subtype | :content-type | |------------------------------------------------------------------+----------------------+---------------+---------+------------------------------+---------------| | der2_cRefset_AssociationReferenceSnapshot_IN1000189_20210806.txt | AssociationRefset | 2021-08-06 | 2 | AssociationReferenceSnapshot | cRefset | | der2_cRefset_AttributeValueSnapshot_IN1000189_20210806.txt | AttributeValueRefset | 2021-08-06 | 2 | AttributeValueSnapshot | cRefset | | der2_cRefset_LanguageSnapshot-en_IN1000189_20210806.txt | LanguageRefset | 2021-08-06 | 2 | LanguageSnapshot-en | cRefset | | sct2_Concept_Snapshot_IN1000189_20210806.txt | Concept | 2021-08-06 | 2 | Snapshot | Concept | | sct2_Description_Snapshot-en_IN1000189_20210806.txt | Description | 2021-08-06 | 2 | Snapshot-en | Description | | sct2_Relationship_Snapshot_IN1000189_20201127.txt | Relationship | 2020-11-27 | 2 | Snapshot | Relationship |

wardle commented 3 years ago

See https://confluence.ihtsdotools.org/plugins/servlet/mobile?contentId=56330817#content/view/56330817

Am I being too strict in my interpretation of the file name conventions here? What type of refsets are these? Are they simple?

sidharthramesh commented 3 years ago

They are simple refsets. They are released by the government body in India. Not sure if they are using a wrong naming convention. Will have to give the document you mentioned a read.

wardle commented 3 years ago

Looks as if I'm using the "Summary" field of the "ContentSubType" element of the filename to determine reference set type. This works with the UK edition, but it isn't working with other distributions. We can determine the type of reference set in other ways - derRefset is always a simple refset - der[pattern]Refset will include additional columns, which we could examine to determine the reference set type, if not included in Summary field.

I don't think they are using the wrong naming convention - it's just different to the UK, and a bit more complicated to determine the type of reference set. Looks like a bug on my part.

sidharthramesh commented 3 years ago

@wardle Great! Yes. I just finished reading the docs you gave me, and the naming convention seems okay. Regarding how to detect if it’s a simple refset, I think your approach seems sensible.

wardle commented 3 years ago

The fix for this will also fix #30 which is nice.

sidharthramesh commented 3 years ago

Yes!! Great! Any way we could help?

wardle commented 3 years ago

I'm pretty much done but I'd like to add more testing.

But as part of doing that, I have come across an issue with the Spanish refsets which don't use patterns or names in file names. It's not completely clear to me yet but I may need to try to deduce file types by looking a column headings. Patterns are meant to tell you how to serialise the user defined data in the item - c or i or s - but without that, all one can do is treat as strings and leave all work to the client.

Currently the Indian refsets you shared are picked up nicely and there's no issue because they're all simple refsets.

It would be helpful if you can test with the full distribution. I can push work so far to a different branch for you to test if you're willing, or I will make a synthetic distribution with these issues in it and test with that and release when I'm happy.

sidharthramesh commented 3 years ago

Thank you. That sounds doable. Please push your changes and let me know. I’ll test it with all the distributions I have and see if it’s picking up all files and report if there are any missing. I think I also have access to the Spanish distribution. Will check that too if I have time.

wardle commented 3 years ago

Thanks @sidharthramesh - the big issue is with the Spanish distribution. Here is just one example:

Filename

der2_cRefset_VMPPCNSpainDrugMapSnapshot_es-ES_es_20211001.txt

So what is that? Well it is a refset. It should have the basic structure and then an extra column 'c' - ie a concept identifier. So let's take a look:

id                                      effectiveTime   active  moduleId                      refsetId        referencedComponentId   mapGroup    mapPriority     mapRule mapAdvice       mapTarget       correlationId   mapCategoryId referencedComponentTerm orderGroup
7e539731-3c21-41fb-ad09-66606c164a60    20180501        1       90000011000140108       90000091000140102       720507  1       1       TRUE    ALWAYS RELATED MAPTARGET        54851000140103  22681000122102  22651000122108   TADALAFILO TECNIGEN 20 MG  COMPRIMIDOS RECUBIERTOS CON PELICULA EFG , 4 COMPRIMIDOS    1

So it's a cross map, with an 'orderGroup' on the end - so not a concept identifier but an integer. Which is fine from a serialisation point of view, because we'd store both as a long anyway, but not ideal. I have other examples in which they've put a string in there and not used a correct pattern.

So at the moment, hermes complains, because it can't work out that this is a cross-map.

The options are:

  1. Flag as an error and just fail. Suggest those working with that distribution get it fixed or manually add naming e.g. SimpleMap to the errant filenames. More robust, but puts onus on user.
  2. Let hermes match on "Map" and assume its a "SimpleMap". Might be flaky and I'm not sure its a correct assumption.
  3. Let hermes use a combination of things to try to match - e.g. naming patterns AND column headings. Might be flaky, but less than (2).
sidharthramesh commented 3 years ago

I’ve always felt that failing outright is always better than making assumptions on unstable grounds. Sometimes this might even behave differently than what the user expects. I would always go with simple and transparent over “automagical” behaviour always.

With the Spanish reference set - I think it’s not correct naming convention wise. I think the error should just say something like “refset format and naming convention mismatch” and ask the end user (or Spanish release center ideally) rename it correctly.

Until then instead out outright failing, you could raise a warning and skip the file during importing and indexing.

wardle commented 3 years ago

Hi @sidharthramesh - it should work now for the Indian reference sets.

fc77e8aea2c1024dac4907fbc3c731681615fda4 also fails fast if there is an issue.

Let me know how you get on.

wardle commented 3 years ago

Dear @sidharthramesh and @DharsanB : You can now use the v0.8.3 release and see whether it resolves the issue. Thanks for testing. Let me know if any issues.

clj -M:run list ~/Downloads/SnomedCT_IndiaReferenceSetsRF2_PRODUCTION_202108067T120000Z

identifies a more complete list of importable files now:

image
sidharthramesh commented 3 years ago

Hey, @wardle Thank you for the incredibly quick fix. We've tested it with our release files, and so far everything is getting indexed. Thanks!