uc-cdis / tube

ETL
Apache License 2.0
7 stars 8 forks source link

PXD-2226 ⁃ gather requirements from GDC for elastic search ETL #42

Closed philloooo closed 5 years ago

philloooo commented 5 years ago

➤ Thanh Nguyen commented:

After discussing with [~yanliu] and [~maxpmy] we define a list of requirements of  ETL that works with GDC as follows: [~phillis.tt@gmail.com]

philloooo commented 5 years ago

➤ Yan Liu commented:

[~thanhnd]: Thanks for capturing this. One clarification on parallelization: the idea is after sqoop ingests the graph DB, we might want to apply filtering per project(s) so we can selectively generate indices for those project(s). The other option is to only ingest the graph data for those projects, but that might get a bit tricky so in this POC we would suggest applying the selection of project(s) after the full set of graph data is ingested. Spark will process all data in parallel across those selected projects. No further parallelization per project is needed.

philloooo commented 5 years ago

yeah that makes more sense, we can add filter syntax to the etl mapping