opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Scope gentropy POS loading strategy (JSON vs Parquet) #3470

Open d0choa opened 1 week ago

d0choa commented 1 week ago

Currently, all data generated by gentropy is parquet-only. This is different from platform-etl, which produces both Parquet and JSON.

As of today, we use JSON to load to OpenSearch. If we keep doing this and there is no alternative to load parquet, we need to define a strategy, whether in the gentropy level or as a tool to convert formats that can be plugged to the orchestration.

jdhayhurst commented 3 days ago

Wanted to put down a summary of the discussion we had in the @opentargets/be-team with @d0choa. Summary is (please add/correct me): 1) remove the concern of backend formatting from the ETL and keep it in POS - if we change the backend stack, the ETL does not want to care. ETL should be for data, POS for distributing the data. 2) start with some sort of simple utility cli program that efficiently takes parquet and converts it to json lines - polars looks promising. 3) first iteration would be easily invoked by existing POS - e.g. run with docker or installable through pypi 4) keep it in scope to make this utility app expandable in future

javfg commented 2 days ago

I think this summarizes it pretty well. Removing any data conversions from ETL is good separation of concerns, as:

  1. PIS grabs a bunch of data from many sources and puts them in a specific place
  2. ETL generates our dataset from the source at that specific place, and puts it in another specific place, in a standard format (parquet)
  3. POS distributes that dataset to the many targets that need it, converting it if needed

We can see this as a core (ETL), with two outer layers, one input (PIS), and one output (POS).

jdhayhurst commented 2 days ago

here's a start. I'll plug it into POS and see if we can get that to work.