[FEATURE] Support load CSV in PPL (inputlookup or search)

LantaoJin commented 2 months ago

Support the functionality of loading data from CSV file.

file location

There are two options in which a CSV file to store:

Upload CSV files to Spark scratch dir where set by SPARK_LOCAL_DIRS environment variable or config spark.local.dir, For example, $SPARK_LOCAL_DIRS/<some_identities>/lookups/test.csv. But uploading to an local dir could introduce potential security issues, especially if the Spark application runs on cloud service.
(Preferred) Upload CSV files to external URL. The user should make sure the application has the access permission to the external URL. For example, s3://<bucket>/foo/bar/test.csv, file:///foo/bar/test.csv.

PPL syntax

There are also two options to support this feature:

A. Introduce a new command `inputlookup` or `input`:

input <fileUrl> [predicate]

Usage:

input "s3://bucket_name/folder1/folder2/flights.csv" FlightDelay > 500

The FlightDelay > 500 only works when the flights.csv contains a csv header.

B. Modify the current `search` command to support file:

search file=<fileUrl> [predicate]

Usage:

search file="s3://bucket_name/folder1/folder2/flights.csv" FlightDelay > 500

PS: the current search command syntax is

search index=<indexName> [predicate]
search source=<indexName> [predicate]

Both option A and B could be used in sub-search:

search source=os dept=ps
| eval host=lower(host)
| stats count BY host
| append
  [
    input "s3://key/lookup.csv" | eval host=lower(host) | fields host count
  ]
| stats sum(count) AS total BY host
| where total=0

search source=os dept=ps
| eval host=lower(host)
| stats count BY host
| append
  [
    search file="s3://key/lookup.csv" | eval host=lower(host) | fields host count
  ]
| stats sum(count) AS total BY host
| where total=0

penghuo commented 2 months ago

+1 on (Preferred) Upload CSV files to external URL. One concern is how to prevent users from accessing any data in the local filesystem, as this poses a security risk.

YANG-DB commented 2 months ago

I agree with @penghuo this is a possible security concern - I would propose using a different approach: use the dashboard for loading a csv file into and index and using this index for the lookup

brijos commented 2 months ago

I hate to be that guy, but I know of those in the community who would want to load the CSV into their index as well as those who want to load the CSV into cloud storage. From a priority perspective, index should be the first as it is the easiest (assuming the analyst has write access to the cluster). Dealing with cloud storage introduces permissions friction.

LantaoJin commented 2 months ago

+1 on (Preferred) Upload CSV files to external URL. One concern is how to prevent users from accessing any data in the local filesystem, as this poses a security risk.

A straightforward solution is allowing the s3:// schema URL only in product.

From a priority perspective, index should be the first as it is the easiest

Yes. I got the priorities. We have the lookup issue https://github.com/opensearch-project/opensearch-spark/issues/620 opened. This issue is for the requirement of loading data from a CSV (similar to the the inputlookup command in Splunk).

opensearch-project / opensearch-spark