nlesc-sigs / data-sig

Linked data, data & modeling SIG
Other
5 stars 3 forks source link

Generic data mapping tool/method #58

Closed c-martinez closed 10 months ago

c-martinez commented 3 years ago

From @CunliangGeng:

Hi Data SIGers,

I came across an issue of data mapping in the STRAP project when making an existing system GameBus to be FHIR-enabled.

Let me give some background info:

To do the mapping, a quick approach is to directly encode the mapping rules using a programming language (I did a bit using Java for a demo). But it'll take efforts to maintain, and the code cannot be easily used for other projects. So, I'm looking for a method or tool to make the mapping process more generic.

I have prepared a mapping codebook in a structured way, see a simple example in Fig2: the mapping between GameBus Player and FHIR Patient. One idea for the generic mapping method might be to create a mapping code generator, which could read the codebook and automatically generate the code that describe the mapping rules.

What do you think? Do you have any other ideas?

Thanks & Cheers, Cunliang

Fig 1:

fig1_GameBus-FHIR_architecture

Fig 2:

fig2_codebook_mapping_GBplayer_FHIRpatient

c-martinez commented 3 years ago

It sounds like what you want is to map from one schema to another. I believe @bpmweel did something very similar in the past, for the Data Quality project? Also I’m not sure if this is a good starting point.

CunliangGeng commented 3 years ago

From @bpmweel,

Hi Cunliang, All,

Indeed we did something similar in the Data Quality project, there it was a database with columns needing to be mapped to certain concepts in an ontology. For that we used a version of the RDF Mapping Language (https://rml.io/specs/rml/) . I see there is also one available for mapping json.

It is a bit of a learning curve to use this mapping language, but I believe it is an ieee standard and supported by multiple tools so it should be more generic than directly creating the mapping in code. For instance I remember being able to load the data in a relational database within an RDF store and add the RML mapping to the same store, then I was able to query the rdf store on the mapped concepts.

Hope this helps!

Kind regards, Berend

CunliangGeng commented 3 years ago

Many thanks, @c-martinez @bpmweel

I took a careful look at what you provided, mainly two tools/methods: R2RM (RDB to RDF Mapping Language) and RML (RDF Mapping Language). Actually RML is based on and extending R2RM, so they are the same thing in essence.

After going through RML examples and syntax, I see that it’s a specific language for schema mapping, to be more precise, concept mapping. The issue I proposed involves not only schema mapping but also value conversion. For example, in Fig2, the mapping from GameBus.player.dateOfBirth (type INT, value 1577836800000) to FHIR.patient.birthDate (type date, value 2020-01-01). We have to map the concept player.dateOfBirth to patient.birthDate, and also convert the value from INT 1577836800000 to date 2020-01-01.

The RML can deal with the concept mapping but not the value conversion. To do the value conversion, it requires some predefined conversion functions to do the real transformation work. Then I’m wondering

  1. if there is any language engine like RML can also take predefined conversion functions as input and then conduct concept mapping and value conversion at the same time?
  2. is it possible to even avoid manually writing the mapping code (e.g. in RML, see this example https://rml.io/specs/rml/#example-JSON) , such as by using some code generator to read the structured codebook in Fig2 and generate the mapping code automatically? my intuition tells it’s not possible, but still I'd like to see any other ideas.
c-martinez commented 3 years ago

Could something like https://openrefine.org/ be useful for these predefined conversions?

I will also ask around to see if there is a SPARQL way of doing this with RDF, so you could combine it with RML.

CunliangGeng commented 3 years ago

Thanks! It's an interesting idea to separate concept mapping and value conversion by handling each with different tools.

CunliangGeng commented 3 years ago

Thanks to Carlos and Adam for the brainstorming!

As promised during the meeting, I'm putting the Google healthcare data harmonization tool (Let's just call it Whistle tool) here as a record.

What it is The Whistle tool is an engine (in Go language) that converts data of one structure to another, based on a configuration file which describes how.

Quick example image

input.json is the source data structure and we want to convert it to the target data format as shown in output.json. What we need to do is just to write the mapping rules in the file trans.wstl. You can see that the code of mapping rules look quite easy and straightforward.

To run this mapping, just using a command like
go run PathOfWhistleMainProgram -input_file_spec=input.json -output_dir=. -mapping_file_spec=trans.wstl.

Try it yourself If you want to make your hands dirty, DO NOT follow the Jupyter Notebook example recommended in the repo REAMDE. That requires google cloud service, making things much more complex. You just need to do as following:

  1. download the repo
  2. install Go, JAVA JDK and protoc as required here
  3. run the build_all.sh​ in the repo directory

Then you can run the mapping tool on your local machine by following the simple examples here.

Want to know more A short talk and its slides

CunliangGeng commented 3 years ago

Got a new idea on avoiding manually writing the code of mapping rules: it looks promising to convert a structured codebook to the Whistle mapping file trans.wstl, and to leave the implementation of transformation functions to developers or maybe just use predefined transformation functions. I'll take a look into it.

CunliangGeng commented 3 years ago

Thanks to @fdiblen, he recommends GraphQL, which can be used to "merge as many REST APIs as you like to form your own API and return the fields you need or new custom fields." See: https://www.apollographql.com/blog/backend/layering-graphql-on-top-of-rest/