PXD-2226 ⁃ gather requirements from GDC for elastic search ETL

philloooo commented 5 years ago

➤ Thanh Nguyen commented:

After discussing with [~yanliu] and [~maxpmy] we define a list of requirements of ETL that works with GDC as follows: [~phillis.tt@gmail.com]

There should be two indices called case and file.
case index has a list of file_id that helps to refer to the id in file index. On the other hand, file index also has case_id that helps to refer to the corresponding case in case index.
besides, case and file index also have the nested objects for which case and file have 1-to-1, 1-to-many or many-to-many relation.
The process should be parallelized by the project id. It also means that, if there are cases, files that have the links to more than one project, we just skip that link while travelling through the graph.

philloooo commented 5 years ago

➤ Yan Liu commented:

[~thanhnd]: Thanks for capturing this. One clarification on parallelization: the idea is after sqoop ingests the graph DB, we might want to apply filtering per project(s) so we can selectively generate indices for those project(s). The other option is to only ingest the graph data for those projects, but that might get a bit tricky so in this POC we would suggest applying the selection of project(s) after the full set of graph data is ingested. Spark will process all data in parallel across those selected projects. No further parallelization per project is needed.

philloooo commented 5 years ago

yeah that makes more sense, we can add filter syntax to the etl mapping

uc-cdis / tube

PXD-2226 ⁃ gather requirements from GDC for elastic search ETL #42