Catalog - Discovery for different interfaces from different spaces

yBqdo2VLaCdftea1MqgSdtEhrPZtV5oJRr4eIUo commented 2 months ago

Hello - what is the 'best-practice/recommendation' for the workflow that involves Catalog Discovery and multiple interfaces?

Our customer has 5 big DB sources and their implementation team is working separately on each DB source (dev1 - running discovery on DB1, dev2 - on DB2, DB3, dev3 - on DB4, etc), and they are doing regex tuning, PII config update, etc...

Everyone has individual WebStudio space currently (with relatively small RAM allocation to keep their K8s on-prem server reasonable).

Once all these individual discoveries / impl is done - they need to merge all their work into single project baseline and then use this baseline to deploy into Fabric execution.

Current challenges / questions:

all 5 DB discovery must be executed from a single space (and it must be pretty big for WebStudio to deal with all catalog discoveries) - so we can generate artifacts (fields_info.csv, etc) for all interfaces before deploying. Is there an option to generate separate set of artifacts per interface (or maybe in future releases)? Or what we recommend in this case?
different DBs matching the same regex-pattern for different fields that gives false-positive results for PII (for example, in DB1 CustomerNumber matching the same regex as DB2 PhoneNumber). Same question as above - what can we suggest to make it less painful?

Best regards, Andrey

tZajFGR0CidT8AVERBHw8puD36HY6oWViykmIIb commented 2 months ago

Hi Andrey, If I understand correctly, you need to combine the artifacts from separate spaces, rather then split them - since different teams work on each interface in a separate space. In any case, the catalog artifact (catalog_field_info.csv) is just a file. Separate files can be combined manually into one single file, uploaded to one space and deployed. Having said that, you can create a space with more memory that will allow to define all the interfaces and run the discovery on them one by one. Regarding the regex definitions - you cannot have separate rules for different interfaces. If this is the issue, running discovery in separate spaces with different rules seems like a good idea.

yBqdo2VLaCdftea1MqgSdtEhrPZtV5oJRr4eIUo commented 2 months ago

Hi Nataly, If we choose to merge artifact manually - do we need this for catalog_field_info.csv only or there are other files as well that we should be merging (profiling, metainfo, etc)?

With regex definition - even if we run it on different spaces with different definitions (to isolate config per interface), we still have to merge all 5 DB results into one single file (pii_profiling.csv, data_profiling.csv, etc) before we deploy. So, it looks like we need to make this config the same for all DBs anyway somehow, otherwise it will be in a conflict in Fabric execution server...

One follow-up question - if we manage to create this aggregated artifact file manually with DB1,DB2,DB3,DB4 and later dev4 has to re-run discovery for DB3 in his space - will it re-create artifact file with DB3 data only (as dev4 doesn't have all other data in his graphDB) or it will update only portion relevant for DB3 ?

Best regards, Andrey

tZajFGR0CidT8AVERBHw8puD36HY6oWViykmIIb commented 2 months ago

I understand that the context of your questions is sensitive data masking. The Masking Catalog actors use 2 files / MTables:

Catalog artifact - catalog_field_info.csv - to identify the PII fields in the Catalog and retrieve their Classification.
Catalog masking settings - catalog_classification_generators.csv - to retrieve the Masking logic based this Classification.

See the detailed logic in this article.

If you manually merge the artifacts from several environments into one, make sure that catalog_classification_generators.csv incudes all Classifications with their respective settings (generator, unique, consistent...). Other files (pii_profiling.csv, data_profiling.csv, metadata_profiling.csv) are not used by the masking mechanism. They are only used by the Discovery job.

On your second question: build artifact works on the Catalog level, so it creates an artifact for all the interfaces in the Catalog. You can switch to an earlier Catalog version, and then the created artifact will reflect the Catalog of that selected version.

k2v-academy / K2View-Academy

Catalog - Discovery for different interfaces from different spaces #1092