practo / tipoca-stream

Near real time cloud native data pipeline in AWS (CDC+Sink). Hosts code for RedshiftSink. RDS to RedshiftSink Pipeline with masking and reloading support.
https://towardsdatascience.com/open-sourcing-tipoca-stream-f261cdcc3a13
Apache License 2.0
47 stars 5 forks source link

Masking feature: regex pattern boolean keys #232

Closed alok87 closed 3 years ago

alok87 commented 3 years ago

Why? Helps in keeping free text columns masked and adds a boolean column giving boolean info about the kind of value in the free text column.

What?

Masking Feature added

Regex Pattern Boolean Keys

Free text columns can contain PII so we do not unmask it, but we want the user to make aggregate analysis on the non pii data in it. So using this a user gets boolean column stating that the text/regex in the complete free text is present.

For example: We add a boolean column favourite_quote_has_philosphy. If value in column favourite_quote matches the regex 'life|time' (case insensitive), then the value in extra column favourite_quote_has_philosphy is true else false.

regex_pattern_boolean_keys:
    customers:
        favourite_quote:
            has_philosphy: 'life|time'
            has_text_funny: 'funny'
alok87 commented 3 years ago

Bug: instead of false it is showing data as empty for the bool cols Screenshot 2021-05-21 at 3 25 25 PM

alok87 commented 3 years ago

Testing in production.

alok87 commented 3 years ago

Length keys enabled if already exist needs to be recreated if the names are not in order