Closed d0choa closed 6 days ago
Wanted to put down a summary of the discussion we had in the @opentargets/be-team with @d0choa. Summary is (please add/correct me): 1) remove the concern of backend formatting from the ETL and keep it in POS - if we change the backend stack, the ETL does not want to care. ETL should be for data, POS for distributing the data. 2) start with some sort of simple utility cli program that efficiently takes parquet and converts it to json lines - polars looks promising. 3) first iteration would be easily invoked by existing POS - e.g. run with docker or installable through pypi 4) keep it in scope to make this utility app expandable in future
I think this summarizes it pretty well. Removing any data conversions from ETL is good separation of concerns, as:
We can see this as a core (ETL), with two outer layers, one input (PIS), and one output (POS).
here's a start. I'll plug it into POS and see if we can get that to work.
Currently, all data generated by gentropy is parquet-only. This is different from platform-etl, which produces both Parquet and JSON.
As of today, we use JSON to load to OpenSearch. If we keep doing this and there is no alternative to load parquet, we need to define a strategy, whether in the gentropy level or as a tool to convert formats that can be plugged to the orchestration.