Open ian-lewis-d opened 1 year ago
Thanks for the kind words. You’ll have to specify “path” which has to contain the path to the SQLite file.
On Tue, 1 Nov 2022 at 21:29, Ian Lewis @.***> wrote:
Loving PiiCatcher but I'm having some trouble pushing PII data to Datahub.
I'm following the instructions here https://tokern.io/docs/catalog/export/#installation and here https://tokern.io/docs/catalog/export/#basic-recipe.
The problem seems to be when reading my Sqllite database (which definitely has data).
Original Scan
$ piicatcher scan mysql --name mands_test_cat --database testdb --port 3306 --uri .rds.amazonaws.com --username --password ****
The sqllite file is called None and has the following structure:
None -> main -> alembic_version -> column_lineage -> columns -> default_schema -> job_executions -> jobs -> schemata -> sources -> sqlite_master -> tables -> tasks
This is my current export.yml file which partially works (talks to Datahub as expected)
type: dbcat.datahub.CatalogSource config: database: main source_names:
- mands_prod_cat sink: type: "datahub-rest" config: server: "http://localhost:8080"
On running $ datahub ingest -c ./export.yml I receive the following error message.
sqlite3.OperationalError: no such table: sources
Do you have any suggestions for how I can approach this?
— Reply to this email directly, view it on GitHub https://github.com/tokern/piicatcher/issues/198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMP7GURSEXOIQI4SO7EJUTWGE47RANCNFSM6AAAAAARUGG7MQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi @vrajat,
My understanding from the Datahub engineering team is that they do not support piicatcher input.
Obviously, if you have a working solution it would be great to see it.
I am now able to run the datahub import
command against the piicatcher SQLite db (using the dbcat.datahub.CatalogSource
configuration.
However the import to Datahub is unsuccessful and tags are not applied.
Here is an example of the errors I see:
[2022-11-24 09:40:47,171] INFO {datahub.ingestion.run.pipeline:174} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://localhost:8080/
[2022-11-24 09:40:53,745] INFO {datahub.ingestion.run.pipeline:197} - Source configured successfully.
[2022-11-24 09:40:53,746] INFO {datahub.cli.ingest_cli:120} - Starting metadata ingestion
-[2022-11-24 09:40:53,985] ERROR {datahub.ingestion.run.pipeline:57} - failed to write record with workunit loan_management.account_holder with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR ::
/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/4/globalTags/tags/1/tag :: "Provided urn [urn.li](http://urn.li/).tag.ADDRESS" is invalid\nERROR ::
/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/5/globalTags/tags/1/tag :: "Provided urn urn.li.tag.PERSON" is invalid\n', 'message':
'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class com.linkedin.entity.Entity:
ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/c', 'status': 422, 'id': 'urn:li:dataset:
(urn:li:dataPlatform:mysql,<snip>.account_holder,PROD)'}) and info {'exceptionClass':
'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.validation.ValidationException:
....
....
{'error': 'Unable to emit metadata to DataHub GMS',
'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: '
'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class '
'com.linkedin.entity.Entity: ERROR :: '
'/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.schema.SchemaMetadata/fields/3/globalTags/tags/1/tag '
':: "Provided urn urn.li.tag.PERSON" is invalid\n'
'\n'
'\tat com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:213)',
'message': 'com.linkedin.metadata.entity.validation.ValidationException: Failed to validate record with class '
'com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/c',
'status': 422,
'id': 'urn:li:dataset:(urn:li:dataPlatform:mysql,<snip>.editions,PROD)'}},
'... sampled of 87 total elements'],
'start_time': '2022-11-24 09:40:47.165725 (11.66 seconds ago).',
'current_time': '2022-11-24 09:40:58.824433 (now).',
'total_duration_in_seconds': '11.66',
'gms_version': 'v0.9.2',
'pending_requests': '0'}
Pipeline finished with at least 87 failures ; produced 181 events in 5.08 seconds.
It seems the errors are similar to "Provided urn urn.li.tag.ADDRESS" is invalid\nERROR
What am I missing to get this to ingest?```
@yjagdale will you be able to help?
Sure @vrajat will have look at it!
Loving PiiCatcher but I'm having some trouble pushing PII data to Datahub.
I'm following the instructions here https://tokern.io/docs/catalog/export/#installation and here https://tokern.io/docs/catalog/export/#basic-recipe.
The problem seems to be when reading my Sqllite database (which definitely has data).
Original Scan
$ piicatcher scan mysql --name mands_test_cat --database testdb --port 3306 --uri ***.rds.amazonaws.com --username *** --password ****
The sqllite file is called
None
and has the following structure:This is my current
export.yml
file which partially works (talks to Datahub as expected)On running
$ datahub ingest -c ./export.yml
I receive the following error message.Do you have any suggestions for how I can approach this?