spe-uob / 2020-HealthcareLake

A reasonably secure data lake for healthcare analytics
MIT License
9 stars 5 forks source link

Implementing other data transfer protocols #103

Closed victorkingi closed 3 years ago

victorkingi commented 3 years ago

Describe the issue As part of the data simulator To-Do, we were to simulate at least 4 modes of data transport including scheduled SFTP, message broker technology, HTTPS and data federation. The HTTPS is sorted since we sent POST requests successfully, we now need a public key and login details for SFTP. We are still researching on data federation and message broker so the requirements will be added by updating this issue.

Desktop:

joekendal commented 3 years ago

Option 1: AWS Transfer Family

https://aws.amazon.com/aws-transfer-family/

aws-transfer-family-s3-efs-how-it-works-diagram 6ff1dc0d717f63d4207ce4670a729aabb85d0d70

Option 2: AWS DataSync

https://aws.amazon.com/datasync/ aws-datasync-how-it-works-diagram-s3-efs-fsx c26c66393dc4e433369ee9947f39e9c54cd338bb

joekendal commented 3 years ago

We were given a spec from the client for our data ingestion and it used a RESTful OpenAPI design. This makes sense for EHRs that are created in-time. For historical data, we are better off using one of these two options. For realtime analytics, we could use a Kinesis Data Firehose

product-page-diagram_Amazon-Kinesis_Data_Firehose@2x-updated d7e297e0f79ee1a2dfe22d105fd53195e43ccfa4
victorkingi commented 3 years ago

Ok, which one of them supports both message broker and SFTP. I was thinking of using rabbitmq from my side, and you would use amazon MQ with a consumer EC2 instance to support the message broker system. How would that look?

image

joekendal commented 3 years ago

Hi, from our perspective this is the first time we've encountered any of these requirements. We were given a spec that defined a RESTful API only. I believed these would be enhancement proposals, unless they are hard requirements. If so, I would recommend that we split our teams to work on this integration separately from our core sprint.

By the looks of it, we can add SFTP support in a future version by simply adding a new staging bucket. We would need to ensure our ETL process produces something consistent when added to the lake.

In terms of message broker, essentially you are either talking about having FHIR messages sent to a server (which can just be the API we originally provided) or we provide a streaming server (like Kafka) but this will in fact be Kinesis.

Are they your requirements or our requirements? Because we can help you work on the same cloud environment if you want to build this stuff now. I'm not aware that the client requested this of the data lake team.

victorkingi commented 3 years ago

Let me consult the team and get back to you

victorkingi commented 3 years ago

How would having a streaming server look like?

joekendal commented 3 years ago

@victorkingi https://docs.aws.amazon.com/firehose/latest/dev/basic-write.html

joekendal commented 3 years ago

also https://medium.com/faun/apache-kafka-vs-apache-kinesis-57a3d585ef78