stackabletech / stackablectl

Commandline tool to interact with a Stackable Data Platform
Other
8 stars 1 forks source link

GDPR right to be forgotten #153

Closed snocke closed 6 months ago

snocke commented 1 year ago

Scenario:

Event data from your E-Commerce Webshop is stored in hbase. Details of your data and master data is stored somewhere else. Questions around this are usual:

Therefore, analyzing the event data can result in an additional ETL job and using up more space. Since event data can get very large you may need to aggregate the data and thus lose details. Another challenge in context of event data from a webshop results from the GDPR context

A customer enforces his right to be forgotten and wants all his data deleted. With hbase, you have access to a single row. Thus, a company is able to quickly find the customer, the events connect with the customer and master data. Customers won't store all there data in hbase. Thus they need to query different sources to get a big picture of the situation. This demo will try to connect event data stored in hbase with data stored in s3 to enrich the events (data federation).

As of now, we need to introduce ACL on the hbase side. Therefore, we can not execute the deletion query. This will be possible with a future enhancement of the demo. Another future development could be modeling the data as a data vault and applying data warehouse concepts.

Tasks:

snocke commented 1 year ago

The configuration and connection between hbase - phoenix - trino has been estabilished. so far the hbase-site.xml needs to be adjusted to:

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.regionserver.wal.codec</name>
    <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local:2282/znode-d6933fb5-a660-4fe7-802e-7770877663b2</value>
  </property>
  <property>
    <name>phoenix.log.saltBuckets</name>
    <value>2</value>
  </property>
  <property>
    <name>phoenix.schema.isNamespaceMappingEnabled</name>
    <value>true</value>
  </property>
</configuration>

The trino-coordinator-default-catalog and trino-worker-default-catalog need the following adjustment:

data:
  phoenix.properties: |
    connector.name=phoenix5
    phoenix.config.resources=/stackable/config/catalog/phoenix/hbase-config/hbase-site.xml
    phoenix.connection-url=jdbc\:phoenix\:zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local\:2282\:/znode-d6933fb5-a660-4fe7-802e-7770877663b2/hbase
    unsupported-type-handling='CONVERT_TO_VARCHAR'

The hbase discovery configmap needs to be extented the following way

 <configuration>                                                                                                                                                                                                    
   <property>                                                                                                                                                                                                       
     <name>hbase.zookeeper.quorum</name>                                                                                                                                                                            
     <value>zookeeper-server-default-0.zookeeper-server-default.gdpr.svc.cluster.local:2282/znode-d6933fb5-a660-4fe7-802e-7770877663b2</value>                                                                      │
   </property>                                                                                                                                                                                                      
   <property>                                                                                                                                                                                                       
     <name>phoenix.schema.isNamespaceMappingEnabled</name>                                                                                                                                                          
     <value>true</value>                                                                                                                                                                                            
   </property>                                                                                                                                                                                                      
</configuration>

However, with dbeaver we can't see actual data. The row count results in the correct number of rows. However, acutal data does not get visualized. This needs further investigation.

create schema "demo_schema";
SHOW SCHEMAS FROM phoenix;

set session phoenix.unsupported_type_handling='CONVERT_TO_VARCHAR';
USE phoenix.demo_schema;
CREATE TABLE t_events (
  hash_col varchar,
  event varchar,
  itemid varchar,
  "timestamp" varchar,
  transactionid varchar,
  visitorid varchar
)
WITH (
  rowkeys = 'hash_col',
  default_column_family = 'cf1'
);

describe t_events;
SHOW TABLES FROM phoenix.demo_schema;
show columns from t_events;

select *
from phoenix.demo_schema.t_events
where visitorid in ('964404');

drop table t_events;
snocke commented 1 year ago

For release we need the following issues:

https://github.com/stackabletech/trino-operator/issues/331 https://github.com/stackabletech/hbase-operator/issues/289 https://github.com/stackabletech/hbase-operator/issues/288

lfrancke commented 6 months ago

We have row level deletes in the trino iceberg demo which demonstrates what we wanted to achieve here, closing