ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.27k stars 595 forks source link

feat: a lot of tables listed by default in Postgres #8763

Closed lostmygithubaccount closed 7 months ago

lostmygithubaccount commented 7 months ago

Is your feature request related to a problem?

a lot (247) of tables are listed in con.list_tables() for a default postgres connection after running just up postgres:

[ins] In [1]: import ibis

[ins] In [2]: con = ibis.postgres.connect(host="localhost", user="postgres", password="postgres", database="ibis_testing")

[ins] In [3]: len(con.list_tables())
Out[3]: 247

most of these seem to be system tables (though I'm not that familiar w/ Postgres details):

See all tables ```python [ins] In [4]: con.list_tables() Out[4]: ['_pg_foreign_data_wrappers', '_pg_foreign_servers', '_pg_foreign_table_columns', '_pg_foreign_tables', '_pg_user_mappings', 'addr', 'addrfeat', 'administrable_role_authorizations', 'applicable_roles', 'attributes', 'bg', 'character_sets', 'check_constraint_routine_usage', 'check_constraints', 'collation_character_set_applicability', 'collations', 'column_column_usage', 'column_domain_usage', 'column_options', 'column_privileges', 'column_udt_usage', 'columns', 'constraint_column_usage', 'constraint_table_usage', 'county', 'county_lookup', 'countysub_lookup', 'cousub', 'data_type_privileges', 'direction_lookup', 'domain_constraints', 'domain_udt_usage', 'domains', 'edges', 'element_types', 'enabled_roles', 'faces', 'featnames', 'foreign_data_wrapper_options', 'foreign_data_wrappers', 'foreign_server_options', 'foreign_servers', 'foreign_table_options', 'foreign_tables', 'geocode_settings', 'geocode_settings_default', 'geography_columns', 'geometry_columns', 'information_schema_catalog_name', 'key_column_usage', 'layer', 'loader_lookuptables', 'loader_platform', 'loader_variables', 'pagc_gaz', 'pagc_lex', 'pagc_rules', 'parameters', 'pg_aggregate', 'pg_am', 'pg_amop', 'pg_amproc', 'pg_attrdef', 'pg_attribute', 'pg_auth_members', 'pg_authid', 'pg_available_extension_versions', 'pg_available_extensions', 'pg_backend_memory_contexts', 'pg_cast', 'pg_class', 'pg_collation', 'pg_config', 'pg_constraint', 'pg_conversion', 'pg_cursors', 'pg_database', 'pg_db_role_setting', 'pg_default_acl', 'pg_depend', 'pg_description', 'pg_enum', 'pg_event_trigger', 'pg_extension', 'pg_file_settings', 'pg_foreign_data_wrapper', 'pg_foreign_server', 'pg_foreign_table', 'pg_group', 'pg_hba_file_rules', 'pg_ident_file_mappings', 'pg_index', 'pg_indexes', 'pg_inherits', 'pg_init_privs', 'pg_language', 'pg_largeobject', 'pg_largeobject_metadata', 'pg_locks', 'pg_matviews', 'pg_namespace', 'pg_opclass', 'pg_operator', 'pg_opfamily', 'pg_parameter_acl', 'pg_partitioned_table', 'pg_policies', 'pg_policy', 'pg_prepared_statements', 'pg_prepared_xacts', 'pg_proc', 'pg_publication', 'pg_publication_namespace', 'pg_publication_rel', 'pg_publication_tables', 'pg_range', 'pg_replication_origin', 'pg_replication_origin_status', 'pg_replication_slots', 'pg_rewrite', 'pg_roles', 'pg_rules', 'pg_seclabel', 'pg_seclabels', 'pg_sequence', 'pg_sequences', 'pg_settings', 'pg_shadow', 'pg_shdepend', 'pg_shdescription', 'pg_shmem_allocations', 'pg_shseclabel', 'pg_stat_activity', 'pg_stat_all_indexes', 'pg_stat_all_tables', 'pg_stat_archiver', 'pg_stat_bgwriter', 'pg_stat_database', 'pg_stat_database_conflicts', 'pg_stat_gssapi', 'pg_stat_progress_analyze', 'pg_stat_progress_basebackup', 'pg_stat_progress_cluster', 'pg_stat_progress_copy', 'pg_stat_progress_create_index', 'pg_stat_progress_vacuum', 'pg_stat_recovery_prefetch', 'pg_stat_replication', 'pg_stat_replication_slots', 'pg_stat_slru', 'pg_stat_ssl', 'pg_stat_subscription', 'pg_stat_subscription_stats', 'pg_stat_sys_indexes', 'pg_stat_sys_tables', 'pg_stat_user_functions', 'pg_stat_user_indexes', 'pg_stat_user_tables', 'pg_stat_wal', 'pg_stat_wal_receiver', 'pg_stat_xact_all_tables', 'pg_stat_xact_sys_tables', 'pg_stat_xact_user_functions', 'pg_stat_xact_user_tables', 'pg_statio_all_indexes', 'pg_statio_all_sequences', 'pg_statio_all_tables', 'pg_statio_sys_indexes', 'pg_statio_sys_sequences', 'pg_statio_sys_tables', 'pg_statio_user_indexes', 'pg_statio_user_sequences', 'pg_statio_user_tables', 'pg_statistic', 'pg_statistic_ext', 'pg_statistic_ext_data', 'pg_stats', 'pg_stats_ext', 'pg_stats_ext_exprs', 'pg_subscription', 'pg_subscription_rel', 'pg_tables', 'pg_tablespace', 'pg_timezone_abbrevs', 'pg_timezone_names', 'pg_transform', 'pg_trigger', 'pg_ts_config', 'pg_ts_config_map', 'pg_ts_dict', 'pg_ts_parser', 'pg_ts_template', 'pg_type', 'pg_user', 'pg_user_mapping', 'pg_user_mappings', 'pg_views', 'place', 'place_lookup', 'referential_constraints', 'role_column_grants', 'role_routine_grants', 'role_table_grants', 'role_udt_grants', 'role_usage_grants', 'routine_column_usage', 'routine_privileges', 'routine_routine_usage', 'routine_sequence_usage', 'routine_table_usage', 'routines', 'schemata', 'secondary_unit_lookup', 'sequences', 'spatial_ref_sys', 'sql_features', 'sql_implementation_info', 'sql_parts', 'sql_sizing', 'state', 'state_lookup', 'street_type_lookup', 'tabblock', 'tabblock20', 'table_constraints', 'table_privileges', 'tables', 'topology', 'tract', 'transforms', 'triggered_update_columns', 'triggers', 'udt_privileges', 'usage_privileges', 'user_defined_types', 'user_mapping_options', 'user_mappings', 'view_column_usage', 'view_routine_usage', 'view_table_usage', 'views', 'zcta5', 'zip_lookup', 'zip_lookup_all', 'zip_lookup_base', 'zip_state', 'zip_state_loc'] ```

Describe the solution you'd like

if possible, avoid listing system tables

What version of ibis are you running?

main

What backend(s) are you using, if any?

Postgres; would be nice to do this for others if it applies

Code of Conduct

gforsyth commented 7 months ago

Can you try out my no schema branch? I think I fixed this over there

lostmygithubaccount commented 7 months ago

mostly fixed! down to 3:

[ins] In [1]: import ibis

[ins] In [2]: con = ibis.postgres.connect(host="localhost", user="postgres", password="postgres", database="ibis_testing")

[ins] In [3]: con.list_tables()
Out[3]: ['geography_columns', 'geometry_columns', 'spatial_ref_sys']

if it's possible/you want to remove those you can leave this open, or perhaps just mark your PR as closing this issue?

gforsyth commented 7 months ago

I'll self-assign this and tackle it as a followup after #8655 is merged.

gforsyth commented 7 months ago

Hmm, well, I'm not sure what to do about those tables -- it's two views and one table and they are in the public "schema" (grrr).

ibis_testing=# \dt public.*
              List of relations
 Schema |      Name       | Type  |  Owner   
--------+-----------------+-------+----------
 public | spatial_ref_sys | table | postgres
(1 row)

ibis_testing=# \dv public.*
              List of relations
 Schema |       Name        | Type |  Owner   
--------+-------------------+------+----------
 public | geography_columns | view | postgres
 public | geometry_columns  | view | postgres
(2 rows)

We can add special handling to remove them from the output of list_tables but I think we'd want to know that users wouldn't ever need to see them.

jcrist commented 7 months ago

We can add special handling to remove them from the output of list_tables but I think we'd want to know that users wouldn't ever need to see them.

IMO I think the optimal behavior for "system" tables like these would be:

Filtering them from list_tables seems fine to me.

gforsyth commented 7 months ago

Currently on the kill-schema branch, if no catalog or database is provided, we use the default values of self.current_catalog and self.current_database -- which for the Ibis testing container should be ibis_testing.public.

I think filtering out the above system tables for a "clean" con.list_tables() sounds good.

Do we want to be less prescriptive if the user is explicitly asking for things in ibis_testing.public?

e.g. do we want something like this? ( I genuinely don't know -- I think I don't like this but I'm waffling)

>>> con.list_tables()
[]
>>> con.list_tables(database=("ibis_testing", "public"))
['geography_columns', 'geometry_columns', 'spatial_ref_sys']
cpcloud commented 7 months ago

I think if we consider non-system tables to be any table that wasn't created by postgres itself, that was created using a CREATE TABLE statement or the corresponding C API, then we should include tables created by extensions in the output of list_tables().

jcrist commented 7 months ago

I think we should be uniform in how we're filtering, so if "system" tables can appear in other databases/catalogs then we should also filter there.

I could also go with Phillip's definition of system tables above, although I think the geo-extension is common enough (and creates some known tables) so special-casing and excluding those here also seems fine to me.

cpcloud commented 7 months ago

It seems like extension tables are probably placed into the public schema for a reason, so listing them doesn't strike me as a horrible choice

cpcloud commented 7 months ago

Postgres' default behavior is not as simple as "look in public", it's got something analogous to PATH but for the schema search space, called search_path, further complicating things.

lostmygithubaccount commented 7 months ago

from a UX perspective, I mainly just want to easily see the tables I create in the database. that's hard with 247 but fine with 3 in there by default. no strong opinion from me on whether those 3 should be listed or not but it's fine for my use cases at least

NickCrews commented 7 months ago

I vote don't do anything clever, show everything in public. If you don't want clutter (which is reasonable) then this should get fixed at the geospatial extension level.

gforsyth commented 7 months ago

I vote don't do anything clever that's a good guiding principle.

Ok, I'm going set #8655 to close this