Implement ETL logic to export cooccurrences for EuroPMC

DSuveges commented 9 months ago

As part of the partnership between OT and EuroPMC, we are feeding back normalized cooccurrences to EuroPMC as part of the release process. Based on a meeting we had on 28th September we agreed upon a schema (see notes: here).

I have been implemented the first prototype in PySpark, which was used to generate a draft version of the data. This piece of data was then fed back to EuroPMC, who could confirm that the data looked OK and is ready to be ingested.

The pyspark implementation of the logic is here: gist

important: The above implementation doesn't account for the required maximum number of rows (10k) in the resulting partitions.

Tasks

[x] Implement logic in the ETL
[x] Decide on file name (EuroPMC_cooccurrence?)
[ ] Adding logic to POS to make sure the data is exported. (Considering this is for EBI internal use only, it is fine to expose only on ftp (they will pick up from /nfs/ftp/....), so no BQ export is required or even listing on the UI)
[ ] Let EPMC team know where the data can be picked up from.

mbdebian commented 3 months ago

@remo87 checking on this one with @DSuveges Thanks!

remo87 commented 3 months ago

I'm closing this item because it's not longer needed due to changes being done in the EPMC side.

opentargets / issues

Implement ETL logic to export cooccurrences for EuroPMC #3111

Tasks