microsoft / Kusto-Query-Language

Kusto Query Language is a simple and productive language for querying Big Data.
Apache License 2.0
510 stars 97 forks source link

Extract columns and their "values" exclusively used in where statements #96

Closed FaresKi closed 1 year ago

FaresKi commented 1 year ago

Hi! There are 2 questions I'm having and was hoping you could help solve. 1) Quick question about the Parser. I'm trying to extract columns that are strictly used in where statements. For example, I got a query like the following:

FunctionName_or_TableName
| where Column1="abcd"
| where Column2="1234"
...

Here's the parsing logic I tried to implement:

public static List<DataModel> GetTablesWithColumns(KustoCode code)
        {
            var columns = new List<DataModel>();
            GatherColumns(code.Syntax);
            return columns;

            void GatherColumns(SyntaxNode root)
            {
                string tableName = "default (unknown)";
                bool isWhere = false;
                SyntaxElement.WalkNodes(root,
                    fnBefore: n =>
                    {
                        /**
                        Starts by counting number of PipeExpressions (lines starting with the '|' symbol)
                        **/

                        if (n.ReferencedSymbol is FunctionSymbol f && !code.Globals.IsBuiltInFunction(f) && n.Parent is PipeExpression)
                        {
                            if (f != null)
                            {

                                if (f.Name != tableName)
                                {
                                    tableName = f.Name;

                                }

                            }
                        }
                        else if (n.ReferencedSymbol is TableSymbol t && code.Globals.IsDatabaseTable(t) && n.Parent is PipeExpression)
                        {
                            if (t != null)
                            {

                                if (t.Name != tableName)
                                {
                                    tableName = t.Name;
                                }

                            }
                        }

                        if (n is Expression e)
                        {
                            if (e is FilterOperator filter)
                            {
                                isWhere = (filter.Keyword.Text == "where");
                            }
                        }

                        if (n is Expression ex && ex is ExtendOperator extendOperator)
                        {
                            isWhere = false;
                        }

                        if (n.ReferencedSymbol is ColumnSymbol c)
                        {
                            if (isWhere)
                            {
                                var existingTable = columns.Find(x => x.Table == tableName);
                                if (existingTable != null)
                                {
                                    existingTable.Columns.Add(c.Name);
                                }
                                else
                                {
                                    var dataModel = new DataModel(tableName, new HashSet<string> { c.Name });
                                    columns.Add(dataModel);
                                }

                            }
                        }
                    });
            }
        }

In other words, I initially fetch the function name (or table name), parse the rest of the query, and when I first encounter a where FilterOperator, I enable the linking of column names to functions. Small problem, there are plenty of edge cases I don't know how to cover. Especially functions and columns that are mentioned inside other functions (toscalar, for example). How would you implement it on your side?

2) Is there a similar - or easily implementable - method that provides a way to verify if a column is amongst the Function's OutputColumns? Something equivalent to TableSymbol.GetColumn's method?

Thanks in advance for your answers! Fares

P.S: I don't know if it's worth it, but here's the definition of the DataModel class:

public class DataModel
    {
        private string table;
        private HashSet<string> columns;
        private string rule;
        public DataModel(string tableName, HashSet<string> columnList)
        {
            table = tableName;
            columns = columnList;
        }
        public string Table
        {
            get { return table; }
        }
        public HashSet<string> Columns
        {
            get { return columns; }
        }

        public string Rule
        {
            set { rule = value; }
        }
    }
javiersoriano commented 1 year ago

@mattwar are you able to assist here?

Thanks

mattwar commented 1 year ago

For #2. It is not possible to know the columns returned by a function without first analyzing the body of the function itself since the functions are not defined declaring a result schema and it may depend on the arguments given to the function when called.

Lucky for you, the parser has already done this analysis, so when you find a function that was called you just need to examine that expression node's ResultType property.

The best thing you can do for #1, is to build a map of all the functions being called and the columns they return. Then when you find those columns being referred to you can attribute them back to the function they came from.