Use file paths to qualify table names for csv files

mahiki commented 1 year ago

Problem: When CSV files in the search path have the same name, they are inaccessible. I work with local CSV a lot, they arrive as the result of spark/trino/pandas transforms and I land them like a hive warehouse to stay organized:

tree data/
data/
└── extracts
    └── csv
        ├── dataset=Page_Traffic
        │   └── rundate=2022-10-23
        │       └── data.csv
        ├── dataset=hdp_sessions
        │   ├── rundate=2022-10-17
        │   │   └── data.csv
        │   └── rundate=2022-10-18
        │       └── data.csv
        ├── dataset=funnel_metrics
        │   ├── rundate=2022-10-21
        │   │   └── data.csv
         ... etc....

# paths = [ "**/*.csv"]
steampipe query
> .inspect csv
+-------+--------------------------------------------------------------------------------------------+
| table | description                                                                                |
+-------+--------------------------------------------------------------------------------------------+
| data  | CSV file at /Users/../data/extracts/csv/dataset=funnel_metrics/rundate=2022-10-22/data.csv |
+-------+--------------------------------------------------------------------------------------------+

Only one of these files will become part of the schema csv. It is very common for CSV files to have same filename but organized by path.

I'd like easy access to all of the tables.

Solution: Qualify the table name by prepending the path from current working directory to the file. Transform characters like /={. to underscore _. You could limit the path to 3 parent directories above the .csv file. Long table names are not a hindrance, and current user experience would be unchanged. Import all these locations of data.csv with distinct names depending on the path from current working directory.

cd data/extracts/csv/; steampipe query

> .inspect csv
+------------------------------------------------+------------------------------------------------------------------+
|                  table                         | description                                                      |
+------------------------------------------------+------------------------------------------------------------------+
| dataset_Page_Traffic_rundate_2022-10-23_data   | CSV file at ./dataset=Page_Traffic/rundate=2022-10-23/data.csv   |
+------------------------------------------------+------------------------------------------------------------------+
| dataset_funnel_metrics_rundate_2022-10-21_data | CSV file at ./dataset=funnel_metrics/rundate=2022-10-21/data.csv |
+------------------------------------------------+------------------------------------------------------------------+
| dataset_hdp_sessions_rundate=2022-10-17_data   | CSV file at ./dataset=hdp_sessions/rundate=2022-10-17/data.csv   |
+------------------------------------------------+------------------------------------------------------------------+
| dataset_hdp_sessions_rundate_2022-10-18_data   | CSV file at ./dataset=hdp_sessions/rundate=2022-10-18/data.csv   |
+------------------------------------------------+------------------------------------------------------------------+

Describe alternatives you've considered

Right now I have to cd data/extracts/csv/dataset=funnel_metrics/Page_Traffic/rundate=2022-10-22 to determine the table definition. Unfortunately, I cant join or combine results from multiple tables unless I rename them or move a set to a directory.
Rename and move tables around before using Steampipe to interact with them.

Additional context I love the steampipe tool. Its a universal interface with so many uses, it is quickly becoming a daily tool for me. Great decisions by the designers.

misraved commented 1 year ago

Welcome to Steampipe @mahiki and thanks for raising the request 👍 .

Great to see that you are leaning towards the csv plugin.

I agree that it's a limitation at the moment, but prepending the directory path to the table name might increase the complexity in generating the dynamic table name.

@cbruno10 thoughts?

github-actions[bot] commented 1 year ago

'This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.'

github-actions[bot] commented 1 year ago

'This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.'

mahiki commented 1 year ago

I keep thinking about this, and its just too common, especially working with hive metadata-managed data stores, that the CSV file name is the same except for the qualifying path.

There isn't a convenient workaround with the current table naming system, I'd have to manually rename a lot of files every time I want to use steampipe for csv.

As to the table name complexity, I can give an example in python since I'm unfamiliar with go.

from pathlib import Path

ppath = Path('/tmp/data/extracts/csv/dataset=funnel_metrics/rundate=2022-10-22/data.csv')
ppath.parts[-4:-1]
# ('csv', 'dataset=funnel_metrics', 'rundate=2022-10-22')

tablename = '_'.join(ppath.parts[-4:-1]).replace('=','_')
# 'csv_dataset_funnel_metrics_rundate_2022-10-22'

Perhaps this behavior could be provided via an option to the paths defintion or when invoking steampipe query?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

cbruno10 commented 1 year ago

One flexible approach would be to have a config argument that allows users to specify how many directory levels above they want to prepend to the table name, e.g.,

tree data/
data/
└── extracts
    └── csv
        ├── dataset=Page_Traffic
        │   └── rundate=2022-10-23
        │       └── data.csv

levels = 0 # Just the filename then table name is data
levels = 1 then table name is rundate=2022-10-23/data.csv
levels = 2 then table name is dataset=Page_Traffic/rundate=2022-10-23/data.csv
levels = -1 # Full path then table name is data/extracts/csv/dataset=Page_Traffic/rundate=2022-10-23/data.csv

Additional questions:

Should we change non-alphanumeric characters to something like _, or is that confusing since file names and directories can have underscores, so they may be difficult to find in the schema?
Do we want to stick with integers for the values?
How does Steampipe report if a connection fails to create due to tables exceeding the table name limit?

@mahiki Would a solution like this help with your use case?

johnsmyth commented 1 year ago

@mahiki would you also want to include / map multiple files into a table? Just brainstorming, but what if we allowed a map of table names to paths, something like this:

connection "csv" {
  plugin = "csv"

  table_defs = {
    page_traffic   = ["~/data/extracts/csv/dataset=Page_Traffic/*/*.csv"]
    hdp_sessions   = ["~/data/extracts/csv/dataset=hdp_sessions/*/*.csv"]
    funnel_metrics = ["~/data/extracts/csv/dataset=funnel_metrics/*/*.csv"]

  }
}

There would be a bunch of details to work out, but does something like that seem appropriate to your case?

cbruno10 commented 1 year ago

Another point maybe to consider is that when a table name > 63 chars is loaded in Steampipe, the table name is automatically truncated down to 63 characters.

cbruno10 commented 1 year ago

I think @johnsmyth 's solution around adding a way to define table names would be the most flexible way, especially considering the table name limit I mentioned above (so even if we offered a way to specify how many directory levels you wanted to include, long directory names would be truncated silently).

I'm not sure what work would be required in the CSV plugin and plugin SDK would be required, but assuming it wouldn't introduce major breaking changes, it seems like a scalable solution for this plugin (and others that read in local or remote files).

Before taking the dive into the plugin and plugin SDK though, I think it'd be good to get some additional user feedback and see if this solution would be helpful.

@mahiki (and others) if you have any feedback on the proposed solution, please let us know in the issue. Else, we will let this issue stale and close (but will re-open if there's interest in it again).

cbruno10 commented 1 year ago

Closing this issue for now, as I think it'd be helpful to get more user input before dedicating the time to work on this feature which would most likely require plugin SDK updates as well.

If anyone is also encountering this issue, please add a comment and we can re-open it.

oaulicino commented 1 year ago

I read this thread and even though I may not be able to offer a solution, my issue is somewhat similar. So, in my case, I get many vulnerability scanning csv files all named very similarly:

report20230801envname1.csv
report20230801envname2.csv
report20230801envname3.csv
report20230801envnameN.csv

For a query to work across all these files, I need to do a UNION ALL and many select statements, one for each file, which is not only onerous, it may lead to mistakes (e.g. there are 10 files and I only included 8 of them in my select / union all.

So, it would be great if we could have something like below:

select * from report20230801*

This would allow for all files named similarly to be included in the result, considering they all have the same headings. Not sure if this helps or not, but it would be of great help.

cbruno10 commented 1 year ago

Hi @oaulicino , I don't think Postgres allows selecting from multiple tables matching on wildcard names in general (but please correct me if I'm wrong), so the example you provided wouldn't be possible.

However, one solution that may work create multiple connections, e.g.,

In your ~/.steampipe/config/csv.spc:

connection "csv_all" {
  plugin = "csv"
  type = "aggregator"
  connections = [ "*" ]
}

connection "csv_1" {
  plugin = "csv"
  paths = ["/Users/foo/csv-aggr/test1/*.csv"]
}

connection "csv_2" {
  plugin = "csv"
  paths = ["/Users/foo/csv-aggr/test2/*.csv"]
}

connection "csv_3" {
  plugin = "csv"
  paths = ["/Users/foo/csv-aggr/test3/*.csv"]
}

Then run the query:

> select * from csv_all.test_csv
+-----------------+----------------------+-----------------------------+
| name            | employer             | _ctx                        |
+-----------------+----------------------+-----------------------------+
| Elaine Benes    | J Peterman Inc       | {"connection_name":"csv_2"} |
| cosmo framer    | kramerica industries | {"connection_name":"csv_1"} |
| George Costanza | Vandelay Industries  | {"connection_name":"csv_3"} |
+-----------------+----------------------+-----------------------------+

The query results will contain any columns found in any of the CSV files (so if there are any columns in one CSV but not the others, there will be a null value for the CSV files missing that column).

If you have multiple files in one directory, you can also configure the connection like:

connection "csv_all" {
  plugin = "csv"
  type = "aggregator"
  connections = [ "*" ]
}

connection "csv_1" {
  plugin = "csv"
  paths = ["/Users/foo/csv-aggr/report*envname1.csv"]
}

connection "csv_2" {
  plugin = "csv"
  paths = ["/Users/foo/csv-aggr/report*envname2.csv"]
}

connection "csv_3" {
  plugin = "csv"
  paths = ["/Users/foo/csv-aggr/report*envname3.csv"]
}

oaulicino commented 1 year ago

hey @cbruno10, yes, I was given this suggestion as well, but it is really an onerous task, so I was hoping for a different solution. I will use that one, but if there is a chance for something simpler, it would be great.

turbot / steampipe-plugin-csv

Use file paths to qualify table names for csv files #45