google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
153 stars 86 forks source link

Create and enable end-to-end test for the batch mode. #8

Closed bashir2 closed 3 years ago

bashir2 commented 4 years ago

We need end-to-end tests that start from a fixed OpenMRS instance, bring up the pipelines in streaming and batch modes to ETL data from the source OMRS into the sink FHIR-store/data-warehouse; then verity the content of the sink.

This depends on #7.

pmanko commented 4 years ago

For testing the streaming modules, we have to trigger the creation and update of clinical data on the OpenMRS side.

I've previously done something similar in Postman using the OpenMRS Rest API, but I think Create/Update is now supported for many resources in the FHIR2 module, so we could use FHIR requests to trigger the streaming side.

Also, @bashir2 do you have an approach in mind to testing? This account might be useful: https://engineering.zalando.com/posts/2019/02/end-to-end-microservices.html#:~:text=End%20to%20end%20testing%20is,of%20them%20can%20get%20tricky.

bashir2 commented 4 years ago

@pmanko re. testing the streaming version, what I was thinking about was much simpler, i.e., start from a set of events on the server side (OpenMRS) and a "clean history" on the client side (our pipeline code). Clean history means that the streaming pipeline will read the events and fetch the corresponding records (e.g., by clearing the DB tables for the Atom Feed client or the event index file of Debezium). IOW, imagine we create a few events in OMRS and then freeze the DB to be used for all subsequent tests. Each test will start from scratch on the client side and read these events from OMRS server.

Also thanks for the pointer.

bashir2 commented 4 years ago

Here is some information on how a simple version of this can work: There are three main pieces (1) the input, i.e., OpenMRS, (2) the pipeline to read FHIR resources from the input and do transformations, (3) the output. For all scenarios, we need simple scripts that bring up the input server, run the pipeline and validate the output. This should be run as part of Travis on each PR.

1) The input is OpenMRS for which we rely on what is done for issue #22, i.e., docker images for OpenMRS and MySQL with some test data pre-loaded. Note this is blocked by issue #38 right now.

2) There are three modes for the pipeline: batch, atom-feed based streaming, and debezium based streaming.

3) The output is either Parquet files or a target FHIR store.

We don't need to cover all combinations.

To test the streaming mode, we can create the events in the DB and save them in the docker image as the initial state. Each time the streaming pipeline runs in the test environment, it reads the events it has not processed (starting from a fake offset in the past) and create the output.

bashir2 commented 3 years ago

@jecihjoy I am changing the scope of this issue to be for the batch mode only. I will create a separate one for the streaming mode (seems @gitcliff has free cycles to work on that). Please close this once PR #56 is submitted (and thanks).

bashir2 commented 3 years ago

Fixed by #56