kids-first / kf-lib-data-ingest

🏭 Kids First Data Ingest Library
https://kids-first.github.io/kf-lib-data-ingest/
Apache License 2.0
5 stars 0 forks source link

Google's FHIR requires GUIDs for ID (and overwrites anything you provide for ID at create time) #577

Closed torstees closed 3 years ago

torstees commented 3 years ago

For study, the ingestion library forces the use of a cleaned version of the study's name (as best as I can tell) to be used for the ID field for actual study resources and also references back to them. This is done by way of adding the transformed study's name the the record under the target_service_id.

When testing out my ingest package using google's healthcare fhir service, those IDs were stripped out and GUID were used instead, resulting in the all attempts to reference the study to fail (since we were assuming that ResearchStudy/study_id would be an actual FHIR resource.

Removing the following lines from load_v1.py (413-414) resolved the issue for me. I would be happy to issue a PR, but I honestly don't know what sort of implications this may have on other parts of the library.

`

guarantee existence of the study ID column

            for r in transformed_records:
                r[CONCEPT.STUDY.TARGET_SERVICE_ID] = self.study_id

`

I believe there are other ways to reference an object, however, using those IDs is the cleanest. The current approach works fine for the hapi FHIR server, but it seems google is a bit picker.

fiendish commented 3 years ago

The loader uses plugins for server-specific behaviors. (See https://kids-first.github.io/kf-lib-data-ingest/tutorial/target_service_plugins/index.html). The library's "default" plugin is written for the Kids First non-FHIR dataservice. Our initial attempt at creating a prototype plugin for FHIR-specific (HAPI) behavior is over in https://github.com/kids-first/kf-model-fhir/tree/master/kf_model_fhir/ingest_plugin (probably buggy since we aren't actually using it). I think a modified GoogleFHIR plugin that accounts for the necessary differences would be the right model for the kind of change that you're suggesting. I think at some point it will probably make sense to migrate the Kids First FHIR plugin into this repository (we just haven't switched over to using FHIR for our server yet), so I'd be super willing to add a Google FHIR plugin of your design to this repository (or the kf-model-fhir repository, but I fear that that repo may have fallen behind the NCPI discussions) if you're working on something that aligns with Kids First.

torstees commented 3 years ago

It's not the plugin, but the load_v1.py script, lines 404 and 405. They cause the CONCEPT.STUDY.TARGET_SERVICE_ID with the value of the study name (possibly modified to remove invalid characters) to be added to the record, which is later picked up as the ID. This is accepted by HAPI FHIR server and works just fine, bit is overwritten by google's FHIR server with a GUID. But, because the library has identified that this has a valid ID, it is forever returned as the study name instead of the actual ID that google has assigned. The only way around it for me was to remove those two lines of code from the load_v1.py inside the kf ingest library.

From: Avi Kelman notifications@github.com Sent: Monday, February 15, 2021 4:17 PM To: kids-first/kf-lib-data-ingest kf-lib-data-ingest@noreply.github.com Cc: Torstenson, Eric Shawn eric.s.torstenson@vumc.org; Author author@noreply.github.com Subject: Re: [kids-first/kf-lib-data-ingest] Google's FHIR requires GUIDs for ID (and overwrites anything you provide for ID at create time) (#577)

The loader uses plugins for target service-specific behaviors. The library's "default" plugin is written for the Kids First non-FHIR dataservice. Our attempt at creating FHIR-specific behavior is over in https://github.com/kids-first/kf-model-fhir/tree/master/kf_model_fhir/ingest_pluginhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkids-first%2Fkf-model-fhir%2Ftree%2Fmaster%2Fkf_model_fhir%2Fingest_plugin&data=04%7C01%7Ceric.s.torstenson%40vumc.org%7Cebb3f2cd86b04f9c177a08d8d1ff67ce%7Cef57503014244ed8b83c12c533d879ab%7C0%7C0%7C637490242188766219%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OBaf3ES%2BCONKpci93f6F6vjxRuMnPOZQonELj0ze1B8%3D&reserved=0 and would be the right model for what you're suggesting.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkids-first%2Fkf-lib-data-ingest%2Fissues%2F577%23issuecomment-779471695&data=04%7C01%7Ceric.s.torstenson%40vumc.org%7Cebb3f2cd86b04f9c177a08d8d1ff67ce%7Cef57503014244ed8b83c12c533d879ab%7C0%7C0%7C637490242188776210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Eqbu0lOU%2BbHhNV1QYoDOEkzVN8FEbBOnU7IrEdBX%2BvI%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAGQXQ5IZISVF52DJ5LBB6LS7GMNPANCNFSM4XU6PGMA&data=04%7C01%7Ceric.s.torstenson%40vumc.org%7Cebb3f2cd86b04f9c177a08d8d1ff67ce%7Cef57503014244ed8b83c12c533d879ab%7C0%7C0%7C637490242188776210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=X%2FkTwQC83PZ9STwOuCyGz10AXYuTkE73C5zYlPRy70s%3D&reserved=0. [ WARNING : This email came from an external source. Please treat this message with additional caution.]

fiendish commented 3 years ago

I updated and expanded my previous comment. (Bad habit, I know). I don't know whether updates come through via email.

It is very likely that I'm misunderstanding the situation. Is it possible for me to see the FHIR plugin you've been working on? That will probably help explain things to me.

But in case I do actually understand currently...

They cause the CONCEPT.STUDY.TARGET_SERVICE_ID with the value of the study name (possibly modified to remove invalid characters) to be added to the record

That tabular record doesn't get sent to the server directly. The plugin first determines how the record gets interpreted and used when submitting things to the server (V2 plugins also allow you to query things from the server to eliminate the requirement for purely local caching to avoid duplication).

All that lines 413-414 are doing in the loader are making a study identifier available to the plugin for every record, but those two lines don't dictate how or even whether the plugin uses it. We should definitely rename the column in those lines to CONCEPT.STUDY.UNIQUE_KEY instead of TARGET_SERVICE_ID or something that isn't definitively the target service ID, but I think that's likely a separate issue.

One could for instance change this line in the old prototype plugin https://github.com/kids-first/kf-model-fhir/blob/master/kf_model_fhir/ingest_plugin/target_api_builders/kfdrc_research_study.py#L58 and then the submitted id field would not be populated that way. One could further change https://github.com/kids-first/kf-model-fhir/blob/72673652efa8fcb61c1c817196eb7f71da53004c/kf_model_fhir/ingest_plugin/kids_first_fhir.py#L96 to change where we try to send things.

N.B. It is very likely that the kf-model-fhir plugin repository I keep referencing is out of date for the current ingest library code. I've been making changes (hopefully all improvements) to the ingest library but have not been directly working with the FHIR subgroup recently, so that repo has languished. As soon as I'm done with my current set of library upgrades I do plan to reconnect with the group to make sure that FHIR development is supported by the library better.

torstees commented 3 years ago

The current plugin hasn't been checked in yet, since I'm still ironing a few other issues that stem from changing a few of those IDs, but this is my basic understanding of where the problem lies:

Inside the load_v1._run(), those two lines were injecting the study name into the actual record itself under the target_service_id associated with the study object. Later, as a result of that inject, the function, load_v1._get_target_id_from_record() always returns the study name. So, in my plugin, where I use get_target_id_from_record(ResearchStudy, record), it always returns the study name even if the object is new and has never been "CREATE"d. As a result, that name is cached as the ID, even though google created a GUID for the ID. The HAPI fhir server actually is fine with those study names as IDs (I'm sure there are rules, about what it would accept, but it's never complained about the study names I've used in the past). This is just a new wrinkle introduced by playing with google's offering.

That said, I may very well be doing something wrong that is causing it to behave strangely.

fiendish commented 3 years ago

[mostly copied from slack for posterity]

The root issue seems to be that we need to make the library compatible with loading studies that don't already exist in the target server.

Kids First study entities are currently created in our dataservice long before we ever actually try to load data into them (for logistical reasons relating to things like AWS bucket creation). That means we always know the correct target identifier for the study before we ingest, and that identifier is what we put in our ingest_package_config.py in the "study" field. If you look at the example contents of ingest_package_config.py at https://kids-first.github.io/kf-lib-data-ingest/tutorial/ingest_package.html#ingest-package-components note that study there is set to SD_MEOWMEOW. That's a dataservice autogenerated identifier, basically identical in function to the GUIDs that Google's FHIR server is forcing you to assign.

Unfortunately I think the library currently enforces the idea that the identifier needs to be pre-assigned and not discovered during operation, which is incompatible with your needs. Clearly that assumption that the study already exists has to go away and we need to make a more flexible design that allows loading research study details inside of the ingest process.

The problem with the workaround that you discovered is that some globally unique identifier of the study really does need to be passed to the other entities in order to uniquely identify them against things (people/specimens/etc) from other studies that might use the same research IDs, since there's generally no law which says that two studies can't use overlapping identifiers. If this part of the explanation is unclear, see e.g. this part of the Kids First service plugin indicating that a participant research identifier is unique within its study: https://github.com/kids-first/kf-lib-data-ingest/blob/1b3eeaa541717c491537f2c9dc40938db7248366/kf_lib_data_ingest/target_api_plugins/kids_first_dataservice.py#L170-L174 We need to preserve the ability to do this, and the person extracting the data is almost certainly not going to want to spray study identifiers everywhere in their extracts/transforms to accomodate this need so the required study ID wouldn't otherwise be available there.

So we still need to make sure that the research study identifier gets special treatment, just without requiring you to know it in advance. I think we don't actually need to use the server's identifiers for any of this and could instead just make a few judicious alternations from study "target service identifier" to some other unique identifier and then we can keep choosing some kind of unique thing that just isn't restricted by the whims of the server.

fiendish commented 3 years ago

@torstees Can you try #580 ? I'm hoping it will just work without requiring further amendments like when I accidentally forced the V2 transition on you