As part of the partnership between OT and EuroPMC, we are feeding back normalized cooccurrences to EuroPMC as part of the release process. Based on a meeting we had on 28th September we agreed upon a schema (see notes: here).
I have been implemented the first prototype in PySpark, which was used to generate a draft version of the data. This piece of data was then fed back to EuroPMC, who could confirm that the data looked OK and is ready to be ingested.
The pyspark implementation of the logic is here: gist
important: The above implementation doesn't account for the required maximum number of rows (10k) in the resulting partitions.
Tasks
[x] Implement logic in the ETL
[x] Decide on file name (EuroPMC_cooccurrence?)
[ ] Adding logic to POS to make sure the data is exported. (Considering this is for EBI internal use only, it is fine to expose only on ftp (they will pick up from /nfs/ftp/....), so no BQ export is required or even listing on the UI)
[ ] Let EPMC team know where the data can be picked up from.
As part of the partnership between OT and EuroPMC, we are feeding back normalized cooccurrences to EuroPMC as part of the release process. Based on a meeting we had on 28th September we agreed upon a schema (see notes: here).
I have been implemented the first prototype in PySpark, which was used to generate a draft version of the data. This piece of data was then fed back to EuroPMC, who could confirm that the data looked OK and is ready to be ingested.
The pyspark implementation of the logic is here: gist
important: The above implementation doesn't account for the required maximum number of rows (10k) in the resulting partitions.
Tasks
EuroPMC_cooccurrence
?)/nfs/ftp/....
), so no BQ export is required or even listing on the UI)