teragrep / cfe_39

HDFS Data Ingestion for PTH_06 use
GNU Affero General Public License v3.0
0 stars 3 forks source link

Implement datasource #8

Open Tiihott opened 6 months ago

Tiihott commented 6 months ago

Description Implement the datasource in pth_06 for querying the semi latest data from HDFS.

Use case or motivation behind the feature request To actually use the collected semi latest data in HDFS the datasource must be implemented in pth_06 for querying.

Related issues Part of issue #2.

Tiihott commented 6 months ago

The datasource is to be made based on the v3migration branch of pth_06, which uses Spark version 3.4.0 instead of the old Spark 2.4.5.

Tiihott commented 5 months ago

Initial design draft and its review done. Changes made to the draft according to the review comments, updated draft waiting for second round of review. Meanwhile implementing class structures that are common between all the data sources to pth_06.

Tiihott commented 5 months ago

Design draft reviewed and approved. Continuing on to implement the HDFS datasource specific code to pth_06 planner and scheduler according to the design draft.

Tiihott commented 5 months ago

HDFS file metadata querying with spark query conditions (topic filtering) implemented to HDFS datasource planner in pth_06. Also implemented Kafka consumer start offset mapping from HDFS files, for use by the Kafka datasource planner. Still work to be done on planner and scheduler to make passing the queried HDFS metadata to tasker work properly.

Tiihott commented 4 months ago

HDFS planner/scheduler initial setup implemented to pth_06. Now continuing on to implement HDFS test environment to pth_06 similar to what was used in cfe_39 HDFS write/read/pruning testing.

Tiihott commented 4 months ago

Test environment for HDFS implemented succesfully to pth_06 after solving several gson and hadoop dependency conflicts. Also, the datasource implementation must be rebased to pth_06 object-refactoring branch instead of the v3migration branch. The object-refactoring branch is up-to-date branch of the spark v3 branches.

Tiihott commented 4 months ago

Rebase from older v3migration branch to newer object-refactoring branch in progress. Changes are required in HDFS query processor and ArchiveMicroStreamReader for the rebase, as several classes between query processors and ArchiveMicroStreamReader have been removed in the new branch.