macbre / sql-metadata

Uses tokenized query returned by python-sqlparse and generates query metadata
https://pypi.python.org/pypi/sql-metadata
MIT License
796 stars 125 forks source link

Resolve subqueries in where clause #185

Open ReinierKoops opened 3 years ago

ReinierKoops commented 3 years ago

Hi,

I think I have found a bug regarding subqueries, namely a subquery that seems intentionally left out in (https://github.com/macbre/sql-metadata/blob/master/sql_metadata/keywords_lists.py):

query = "SELECT a.model FROM CAR_NAMES a JOIN CARS_DATA b ON a.MakeId = b.Id WHERE b.Weight < (SELECT avg(Weight) FROM CARS_DATA)" print(query.columns, "\n", query.columns_aliases_names, "\n", query.tables, "\n", query.tables_aliases, "\n", query.subqueries)

I expected this to result in recognition of the subquery _(SELECT avg(Weight) FROM CARSDATA), however it is not recognized, the output is: ['car_names.model', 'car_names.makeid', 'cars_data.id', 'cars_data.weight', 'weight'] [] ['car_names', 'cars_data'] {'a': 'car_names', 'b': 'cars_data'} {}

Something else, which I think is a different issue; I would expect the column 'weight' of the subquery to be recognized as 'cars_data.weight' as well.

I think it would be useful when the complete query has more than one table found, to append the table name before any column found.

collerek commented 3 years ago

For now, only column defining subqueries and join subqueries are supported -> so resolving subqueries in where clause it's an enhancement.

As for the column names right now we do not try to guess from which table the column comes if it's not explicitly provided, in your sample it's quite easy (one table) but in a lot of cases it's not:

Select col_a, col_b from table_a, table_b # (from which table is which column?)

Select col_a, col_b from (select * from table_a left join table_b on aa=bb)  sub # again which column from which table etc.

If you want explicit resolving you need to prefix the column with an alias, so a subquery like

SELECT avg(cd.Weight) FROM CARS_DATA cd
ReinierKoops commented 3 years ago

Thank you, would be nice enhancement. Are you perhaps familiar with any tool that could do the alias prefix resolving you mentioned? Currently, I have to deal with user input, which often is prone to errors like in the stated example. Normally one would for example use prepared statements and the like, however, my usecase is for teaching/learning/sandbox purposes so its supposed to be free form sql input.

collerek commented 3 years ago

What do you mean by alias resolving?

We have a bug that you reported (multiple aliases without as in from clause) Apart from this if you provide the prefix for a column it should be resolved by sql-metadata (also to subqueries and with queries).

If you mean to resolve the columns to tables without explicitly provided table names/aliases it's often impossible without access to the database (you would have to query system tables to check names in tables columns) -> .i.e. the samples I provided.

ReinierKoops commented 3 years ago

Yes, I meant resolving the columns to tables without explicitly provided table names/aliases.

Makes sense, I guess I will have to find a way around it. Thanks for your help 😄

collerek commented 3 years ago

If there is no prefix the column name has to be unique in the scope of the query -> which means that you should be able to resolve it with db access -> either by quering system tables or by auto-reflecting tables i.e. with sqlalchemy.

So you can extract all tables used with sql-metadata but then you need the actual db to fetch the list of columns per table.

valentingarnier commented 2 years ago

Hi,

I saw that in order to spot subqueries you rely on having an alias for the whole subquery. Problem with the WHERE clause is that usually subqueries inside doesn't have any alias. Any idea on how to find a way around for spotting subqueries inside WHERE clause? I was thinking about working with pairs of parentheses but subqueries are not the only one surrounded by parentheses.

Thanks!

analyst009 commented 2 years ago

Yes, I meant resolving the columns to tables without explicitly provided table names/aliases.

Makes sense, I guess I will have to find a way around it. Thanks for your help 😄

Hi Reinier,

I am also facing similar problem. Which work around worked for you? Could you please help me on this.

Best Regards, Chandan

leonardomathon commented 2 years ago

It would be nice if this feature to resolve subqueries in the where clause is implemented.