microsoft / Kusto-Query-Language

Kusto Query Language is a simple and productive language for querying Big Data.
Apache License 2.0
510 stars 97 forks source link

ColumnSymbol: tracking column renames #65

Closed davidnx closed 1 year ago

davidnx commented 2 years ago

Summary

The union operator can rename output columns to resolve conflicts when inputs have mismatched types, but the library does not surface critical info about such renames, and it becomes impossible (except by rewriting substantial parts of the internal logic) to follow such column renames, since the new ColumnSymbol's have no relationship to the original.

Example of a not-so-trivial query where this takes place:

let a = datatable(a:long, a_string:string) [ 123, 'abc' ];
let b = datatable(a:string) [ '123' ] ;
a | union b

The resulting type is a TableSymbol (a_long: long, a_string: string, a_string1: string). The mapping between original columns and output columns is non-obvious, and it would be great if the Kusto library would make it easier to follow.

Proposal

Add a new property (perhaps ReferencedColumn or OriginalColumn) to ColumnSymbol that links to the original ColumnSymbol before the rename. Consumers wanting to find all uses of a symbol could then easily follow the chain of renames.

Motivation

This would enable better intellisense experiences (highlighting an occurrence of the output column could also highlight previous occurrences of the column across renames). It would also facilitate implementing the union operator (and likely other parts of the language) in baby-kusto.

This looks (at first glance at least) like an easy change to make, and perhaps only this piece of code would be impacted:

https://github.com/microsoft/Kusto-Query-Language/blob/fb8a70cb4f8e6ef6786b0c97764330c646f976a4/src/Kusto.Language/Binder/Binder_TablesAndColumns.cs#L462-L468

davidnx commented 2 years ago

Might also be worth thinking about other ways columns can be renamed and whether we would want to do something similar there (e.g. project-rename operator, or project / extend with trivial expressions that just map one column to another). None of these other scenarios are critical to me at this time, though. If we can solve the union scenario as in the initial write-up, that would be great.

davidnx commented 1 year ago

Friendly ping. Is the team willing to accept a contribution for this (see Proposal in the issue description)? @mattwar can you help?

mattwar commented 1 year ago

I implemented a prototype of this to try it out. I was not happy with the performance and allocation cost of maintaining a possible list of originating columns per column just to potentially answer this type of question. As it turns out there are many other locations where columns may get renamed implicitly, some via operators, some via functions aggregates or plugins. There are also all the places you mention and more where columns get renamed explicitly, but it was not clear to me why the distinction of whether a column is renamed via projection (a = b) is any more useful than knowing that it was defined based on other columns (a = b + c). I got to the point of concluding that solving this problem or possible set of problems was not worth the time or effort.

davidnx commented 1 year ago

Thanks @mattwar. It makes sense that solving this in the general case for arbitrary renames might be tricky. The concrete pain point I am facing that I would like to get unblocked is illustrated in this test case for BabyKusto: test Union_DifferentAndNonMatchingSchemas2

The problem I am facing there is in implementing the union operator. I haven't found a suitable way to determine that input column v: real should produce output column v_real: real, short of duplicating code from Microsoft.Azure.Kusto.Language. Any suggestions how I might achieve that? Do you think it would make sense to add support for something like my suggestion even if it doesn't support all cases of column renames?

Add a new property (perhaps ReferencedColumn or OriginalColumn) to ColumnSymbol that links to the original ColumnSymbol before the rename

This would at least unblock this scenario, and would provide some value to intellisense as-is. Even if it isn't perfect, it would be strictly better than what can be achieved today. Thoughts?

mattwar commented 1 year ago

I will reconsider a simpler version.

mattwar commented 1 year ago

I've added a simple column tracking that works for union & join/lookup. You can access a column's OriginalColumns collection to see if it was implicitly renamed via the join operator or is a replacement/stand-in for multiple columns like with the union operator.

davidnx commented 1 year ago

Looks like that will do the trick, thank you @mattwar!

https://github.com/microsoft/Kusto-Query-Language/commit/a8def9dbdbcbede1938e347a5f353eee5d0da0f9#diff-7d9669199139c427e19af1ec3c55e261cdedac95ffe940daec5e9bf2b336f6d9R24-R29

Thoughts when I might expect an updated Nuget release? No need to expedite, just hoping to get a rough time frame. This is not a blocker nor urgent. Thanks.