Closed philloooo closed 5 years ago
➤ Yan Liu commented:
[~thanhnd]: Thanks for capturing this. One clarification on parallelization: the idea is after sqoop ingests the graph DB, we might want to apply filtering per project(s) so we can selectively generate indices for those project(s). The other option is to only ingest the graph data for those projects, but that might get a bit tricky so in this POC we would suggest applying the selection of project(s) after the full set of graph data is ingested. Spark will process all data in parallel across those selected projects. No further parallelization per project is needed.
yeah that makes more sense, we can add filter syntax to the etl mapping
➤ Thanh Nguyen commented:
After discussing with [~yanliu] and [~maxpmy] we define a list of requirements of ETL that works with GDC as follows: [~phillis.tt@gmail.com]
case
andfile
.case
index has a list offile_id
that helps to refer to the id infile
index. On the other hand,file
index also hascase_id
that helps to refer to the corresponding case incase
index.case
andfile
index also have the nested objects for whichcase
andfile
have1-to-1
,1-to-many
ormany-to-many
relation.cases
,files
that have the links to more than one project, we just skip that link while travelling through the graph.