mechatroner / RBQL

🦜RBQL - Rainbow Query Language: SQL-like query engine for (not only) CSV file processing. Supports SQL queries with Python and JavaScript expressions.
https://rbql.org
MIT License
276 stars 13 forks source link

Handling column names directly? #6

Closed adamroyjones closed 4 years ago

adamroyjones commented 5 years ago

Typically, a csv file will have a header row (e.g. carat,cut,color,clarity,depth,table,price,x,y,z for the diamonds data set from ggplot2).

Is there a chance of adding an option to perform RBQL queries directly with the columns (e.g. carat) instead of having to reference them by their index (e.g. a1)?

mechatroner commented 5 years ago

Hi @docbiz !

There will be a lot of corner cases:

  1. What if column names are long?
  2. What if column names contain spaces or other non-alphanumeric chars? OK we can handle this by adding a rule that all non-alphanumeric should be replaced with an underscore.
  3. Some tables do not have a header row.
  4. Many result set tables wouldn't have a header row even if the source table had it.

BTW maybe the root problem why we may need this feature is because it is hard to associate a1,a2... vars with columns? I mean it should be trivial for VSCode version since there is an alligned table, but for standalone rbql console apps it is really hard to tell what is the number of a specific column? In this case it is possible to add some terminal user interface to rbql-js and rbql-py. So basically, what is your use case? Why using column names from header for you is more convenient than a1,a2,... vars?

adamroyjones commented 5 years ago

It's certainly true that always and only referencing columns by their header row (where one exists) would be very error-prone. But I think it would be a nice option—I stress 'option'—to permit queries that reference columns by their header name. (The header names could be restricted to, say, not include the file's separator or `; in this way, you could reference column names that contained spaces by wrapping it in ` characters (much like in R).)

The advantage of queries that use the names of the columns is that they can be understood on their own without the user having to keep flicking back and forth between the query and the table(s). That is: yes, it can be a bit frustrating to associate a1, a2, etc. with specific columns!

(My personal use case is wanting to casually reshape data without firing up something like R, but I think this would be a generally useful feature.)

If you want, I'll try to find some time over the next few weeks to come up with a pull request?

mechatroner commented 5 years ago

I like your idea with wrapping column names in backtick chars, it is better than underscore replacement which I had in mind. OK, go ahead with a pull request. Please implement it just for one backend (Python or JS whichever one you like more) and then we will merge it into the other one.

RobertGiesecke commented 5 years ago

Please don't use some weird mysql-specific stuff. In SQL, you quote identifiers by enclosing them in double quotes, which also makes them case-sensitive.

mechatroner commented 5 years ago

There are actually some aspects of this problem that I didn't think about before.

  1. If column names are used in query, header row should be skipped by record iterator.
  2. Probably combining column indexes (a1, a2 etc) and column names should be disallowed in a single query because column indexes have header-agnostic semantic (do not suppose to skip the header row)
  3. Switching between column names and column indexes can be done with a special flag in RBQL user interface.
  4. There should be a way to distinguish between column names in the main table A and join table B when these tables share columns with the same names.
  5. Column names should be easily and reliable detectable by a simple parser. I.e. it should be impossible to mix them up with JS or Python string literals and/or reserved keywords. Maybe scheme like a.columnname and b.columname can be used where columname part is a column name with all non-alphanumeric characters removed. Or it can be a."Colum name" or something like that.

Anyway it looks like any possible implementation of this feature would be very tricky because of all these considerations.

mechatroner commented 5 years ago

Maybe there is actually an interesting trick to achieve this, currently RBQL uses a1, a2, ... notation. But if we declare special a and b objects in Python and JS templates and initialize them with the current record, then the following 3 valid Python/JS syntaxes would be automatically available with these special objects if properly implemented:

  1. a[1], a[2], ....,b[1], .... - array notation, looks ugly for SQL
  2. a.name, a.age, a.columnname, ... , b.phonenumber, ... - attribute notation, looks like valid SQL. Downside is that not all column names can be used as attributes, but a simple rule can be used - just discard all non-alphanumeric chars.
  3. a["name"], a["age"], a["Column name"], .... , b["Phone number"] - dictionary notation, looks ugly for SQL
peheje commented 5 years ago

Another solution could be to translate the query string with column names into a query string with a1, a2, an. Then feed that into the current engine.

select name, age where age > 2 -> select a1, a2 where a2 > 2

By reading header you can build dictionary {name:1,age:2}

No core engine requirements changed.

mechatroner commented 5 years ago

@peheje This will not work because we can't reliably "translate" the query string. Think about name collisions between column names and built-in python/js keywords, identifiers and popular modules e.g. columns named "if", "sys", "os", "or", "and" etc. So we can't just replace all tokens in a query that match column names in the csv file. One way to work around this is to require all column names to be wrapped in backticks as @docbiz proposed, then we can make the translation work reliably. But I think it is possible to totally avoid the translation by combining notations 2 and 3.

peheje commented 5 years ago

I think you misunderstood. This would all be string manipulation. No collisions chance regardless of language.

If a column was named "null" we would look at the header string. Find the column with name null, count the number let's says 2 and 'intelligently' replace that "null" from the query string with a2.

Let me know if I should make an example.

On Sat, Jun 29, 2019, 15:56 Dmitry Ignatovich notifications@github.com wrote:

@peheje https://github.com/peheje This will not work because we can't reliably "translate" the query string. Think about name collisions between column names and built-in python/js keywords, identifiers and popular modules e.g. columns named "if", "sys", "os", "or", "and" etc. So we can't just replace all tokens in a query that match column names in the csv file. One way to work around this is to require all column names to be wrapped in backticks as @docbiz https://github.com/docbiz proposed, then we can make the translation work reliably. But I think it is possible to totally avoid the translation by combining notations 2 and 3.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mechatroner/RBQL/issues/6?email_source=notifications&email_token=ABXPXFSATDI3LOBHKMEWPATP45SXFA5CNFSM4GJHIA72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY3ZHHA#issuecomment-506958748, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXPXFW3R4OJ4Z32KFQCTA3P45SXFANCNFSM4GJHIA7Q .

mechatroner commented 5 years ago

OK, That's how I understand your proposal (and problems associated with it):

let's say we have a table:

name,age,JSON
Andy,30,"[100, 150]"
Maria,20,"[200, 125]"

And the query (in javascript) is:

SELECT JSON.stringify({'person_name': name, 'person_age': age})
  1. we read the header and build the following dictionary: {name: a1, age: a2, JSON: a3}

  2. we patch the query using the dictionary and get:

    SELECT a3.stringify({'person_name': a1, 'person_age': a2})

    which is obviously an error, a3. stringify(...) will throw an exception because a3 is a string that doesn't have stringify() function.

So how do you suggest to alter that algorithm to avoid the problem?

peheje commented 5 years ago

I admit, I didn't think of that. However the user could be forced to specify where the column is coming from: a or b. Required in the case of join anyway.

Maybe this could be a solution?

String.prototype.replaceAll = function(search, replacement) {
    return this.replace(new RegExp(search, 'g'), replacement);
};

var dic = { "a.name": "a1", "a.age": "a2", "a.JSON": "a3" };
var query = "SELECT JSON.stringify({'person_name': a.name, 'person_age': a.age}) order by a.JSON";
var patched = query;

for (prop in dic) {
    patched = patched.replaceAll(prop, dic[prop]);
}

console.log(patched);
// SELECT JSON.stringify({'person_name': a1, 'person_age': a2}) order by a3

I'm know sure how to handle spaces in column names though, maybe unsupported?

mechatroner commented 5 years ago

Yep, I think that a.name, a.age, b.name, etc notations are the way to go. But there is no need to do replaceAll, because it is unreliable even with "a." and "b." prefixes. Original user query should be altered as little as possible. One case where it will definetely fail are string literals: should we replace a.name inside string literals? In case of SQL keywords ("SELECT", "WHERE", etc) the answer is no, and RBQL isolates string literals before it starts searching for the keywords. But in case of a.name the answer is not so obvious. Here is an absolutely artificial and contrived example, but it will fail with replaceAll:

SELECT `a.name = ${a.name}`

The output would be a1 = Andy instead of a.name = Andy. I agree that such situation would probably never happen in practice, but my general position still holds: original query should be altered as little as possible. This is also important for exceptions/errors that contain failed query expression, if the query was significantly altered it wouldn't be recognizable by the user, and it would be harder to correct that error.

So to implement this "a.name" etc notations I suggest to initialize special "a" and "b" objects (if either "a." or "b." were detected somewhere in the query). This should be relatively easy to do - almost the same code that currently initializes a1, a2, b1 etc variables. "a" and "b" can also support square bracket access like a["person name"]. Those who think that this is super ugly and non-SQLish can pretend that this feature doesn't exist and only columns without whitespaces/special characters are supported. I don't have 100% guarantee that this approach would work, but I currently don't see why it wouldn't, and the only way to be sure is to try and implement this.

fahnzmode commented 4 years ago

Doesn't address this issue directly, but something that might be worth considering is to simply allow skipping of the header row when the query is performed.

mechatroner commented 4 years ago

@fahnzmode I was planning to do this at some point. Implementing an additional "--skip-header" option should be easy, but it has to be supported in all user interfaces - this is the hard part. Also I would like to add support to NL variable simultaneously - which would be at least NR + 1 in this case (or even greater if some records span multiple lines). BTW until it's implemented you can use where NR > 1 to skip the header line.

fahnzmode commented 4 years ago

Thanks for the where NR > 1 tip @mechatroner !

mechatroner commented 4 years ago

So, I wrote the code to support a.good_alphanumeric_column_name and a["arbitrary column name!"] notation, it's available in the master branch. Took me almost 150 commits [sigh] Soon I will start deploying it to the apps. And I would really appreciated any feedback, guys! Please, let me know what you think.

PS I rewrote JS implementation: replaced all callbacks with async/await so it should be much easier to read now.

mechatroner commented 4 years ago

Column name support was added for all RBQL apps that I am supporting. For VSCode version and rbql.org I also added variable autocomplete feature - when user types "a." the app would show all possible column names. This autocomplete is in early development stage: e.g. typing "b." prefix would not show anything which is obviously wrong.

Many thanks to everyone who participated in this discussion!

I am not sure if RBQL would ever support column names without "a." and "b." prefixes, still we can open a new issue for this.