tokern / piicatcher

Scan databases and data warehouses for PII data. Tag tables and columns in data catalogs like Amundsen and Datahub
https://tokern.io/piicatcher/
Apache License 2.0
267 stars 89 forks source link

Unclear example of export to datahub #198

Open ian-lewis-d opened 1 year ago

ian-lewis-d commented 1 year ago

Loving PiiCatcher but I'm having some trouble pushing PII data to Datahub.

I'm following the instructions here https://tokern.io/docs/catalog/export/#installation and here https://tokern.io/docs/catalog/export/#basic-recipe.

The problem seems to be when reading my Sqllite database (which definitely has data).

Original Scan

$ piicatcher scan mysql --name mands_test_cat --database testdb --port 3306 --uri ***.rds.amazonaws.com --username *** --password ****

The sqllite file is called None and has the following structure:

None
  -> main
          -> alembic_version
          -> column_lineage
          -> columns
          -> default_schema
          -> job_executions
          -> jobs
          -> schemata
          -> sources
          -> sqlite_master
          -> tables
          -> tasks

This is my current export.yml file which partially works (talks to Datahub as expected)

  type: dbcat.datahub.CatalogSource
  config:
    database: main
    source_names:
      - mands_test_cat
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

On running $ datahub ingest -c ./export.yml I receive the following error message.

sqlite3.OperationalError: no such table: sources

Do you have any suggestions for how I can approach this?

vrajat commented 1 year ago

Thanks for the kind words. You’ll have to specify “path” which has to contain the path to the SQLite file.

On Tue, 1 Nov 2022 at 21:29, Ian Lewis @.***> wrote:

Loving PiiCatcher but I'm having some trouble pushing PII data to Datahub.

I'm following the instructions here https://tokern.io/docs/catalog/export/#installation and here https://tokern.io/docs/catalog/export/#basic-recipe.

The problem seems to be when reading my Sqllite database (which definitely has data).

Original Scan

$ piicatcher scan mysql --name mands_test_cat --database testdb --port 3306 --uri .rds.amazonaws.com --username --password ****

The sqllite file is called None and has the following structure:

None -> main -> alembic_version -> column_lineage -> columns -> default_schema -> job_executions -> jobs -> schemata -> sources -> sqlite_master -> tables -> tasks

This is my current export.yml file which partially works (talks to Datahub as expected)

type: dbcat.datahub.CatalogSource config: database: main source_names:

On running $ datahub ingest -c ./export.yml I receive the following error message.

sqlite3.OperationalError: no such table: sources

Do you have any suggestions for how I can approach this?

— Reply to this email directly, view it on GitHub https://github.com/tokern/piicatcher/issues/198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMP7GURSEXOIQI4SO7EJUTWGE47RANCNFSM6AAAAAARUGG7MQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ian-lewis-d commented 1 year ago

Hi @vrajat,

My understanding from the Datahub engineering team is that they do not support piicatcher input.

Obviously, if you have a working solution it would be great to see it.

I am now able to run the datahub import command against the piicatcher SQLite db (using the dbcat.datahub.CatalogSource configuration.

However the import to Datahub is unsuccessful and tags are not applied.

Here is an example of the errors I see:


[2022-11-24 09:40:47,171] INFO     {datahub.ingestion.run.pipeline:174} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://localhost:8080/
[2022-11-24 09:40:53,745] INFO     {datahub.ingestion.run.pipeline:197} - Source configured successfully.
[2022-11-24 09:40:53,746] INFO     {datahub.cli.ingest_cli:120} - Starting metadata ingestion
-[2022-11-24 09:40:53,985] ERROR    {datahub.ingestion.run.pipeline:57} -  failed to write record with workunit loan_management.account_holder with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: 
/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/4/globalTags/tags/1/tag :: "Provided urn [urn.li](http://urn.li/).tag.ADDRESS" is invalid\nERROR :: 
/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/5/globalTags/tags/1/tag :: "Provided urn urn.li.tag.PERSON" is invalid\n', 'message': 
'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: 
ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/c', 'status': 422, 'id': 'urn:li:dataset:
(urn:li:dataPlatform:mysql,<snip>.account_holder,PROD)'}) and info {'exceptionClass': 
'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.validation.ValidationException:
....
....
              {'error': 'Unable to emit metadata to DataHub GMS',
               'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: '
                                      'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class '
                                      'com.linkedin.entity.Entity: ERROR :: '
                                      '/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/3/globalTags/tags/1/tag '
                                      ':: "Provided urn urn.li.tag.PERSON" is invalid\n'
                                      '\n'
                                      '\tat com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:213)',
                        'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class '
                                   'com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/c',
                        'status': 422,
                        'id': 'urn:li:dataset:(urn:li:dataPlatform:mysql,<snip>.editions,PROD)'}},
              '... sampled of 87 total elements'],
 'start_time': '2022-11-24 09:40:47.165725 (11.66 seconds ago).',
 'current_time': '2022-11-24 09:40:58.824433 (now).',
 'total_duration_in_seconds': '11.66',
 'gms_version': 'v0.9.2',
 'pending_requests': '0'}

 Pipeline finished with at least 87 failures ; produced 181 events in 5.08 seconds.
It seems the errors are similar to "Provided urn urn.li.tag.ADDRESS" is invalid\nERROR

What am I missing to get this to ingest?```
vrajat commented 1 year ago

@yjagdale will you be able to help?

yjagdale commented 1 year ago

Sure @vrajat will have look at it!