maropu / spark-data-repair-plugin

Provide functionality to build statistical models to repair dirty tabular data in Spark
Apache License 2.0
12 stars 4 forks source link

Generate training data for key inferences via JDBC #2

Closed maropu closed 4 years ago

maropu commented 4 years ago

This ticket intends to generate training data (key or not, column name) for primary/foreign/unique key inferences from column names.

maropu commented 4 years ago

https://github.com/maropu/scavenger/commit/fd32dcffc874799a689d2e7068ba918f4a377dc3

maropu commented 4 years ago
postgres=# \d
          List of relations
 Schema |    Name    | Type  | Owner  
--------+------------+-------+--------
 public | department | table | maropu
 public | staff      | table | maropu
(2 rows)

postgres=# \d department
                    Table "public.department"
 Column |         Type          | Collation | Nullable | Default 
--------+-----------------------+-----------+----------+---------
 name   | character varying(10) |           | not null | 
Indexes:
    "department_pkey" PRIMARY KEY, btree (name)
Referenced by:
    TABLE "staff" CONSTRAINT "staff_depname_fkey" FOREIGN KEY (depname) REFERENCES department(name)

postgres=# \d staff
                        Table "public.staff"
  Column   |         Type          | Collation | Nullable | Default 
-----------+-----------------------+-----------+----------+---------
 id        | integer               |           |          | 
 staffname | character varying(10) |           |          | 
 depname   | character varying(10) |           |          | 
Foreign-key constraints:
    "staff_depname_fkey" FOREIGN KEY (depname) REFERENCES department(name)

>>> df = schemaspy.setDbName('postgres').setDriverName('postgresql').setProps('host=localhost,port=5432').catalogToDataFrame()
>>> df.show()
+----------+----------+-------+--------+------------+------------+
| tableName|columnName|   type|nullable|isPrimaryKey|isForeignKey|
+----------+----------+-------+--------+------------+------------+
|department|      name|varchar|    true|        true|       false|
|     staff|        id|    int|   false|       false|       false|
|     staff| staffname|varchar|   false|       false|       false|
|     staff|   depname|varchar|   false|       false|        true|
+----------+----------+-------+--------+------------+------------+