m-kovalsky / fabric_cat_tools

Supercharge your Microsoft Fabric development with the fabric_cat_tools library
MIT License
100 stars 14 forks source link

Introduce a setting similar to DAX Studio to not query direct lake tables for columns statistics but just get the memory footprint as is. #7

Closed hmayer1980 closed 2 months ago

hmayer1980 commented 2 months ago

Is your feature request related to a problem? Please describe. if you run vertipaq analyzer - the dax studio supports a setting called "read statistics from data" in the options which prevents reading actual data (from Direct Query or Direct Lake) models. As it was noted in the function signature it is necessary to not query the model otherwiese all columns will be loaded into memory. But also querying with Spark can introduce a very long runtime / cost if those queries are run on the lakehouse via spark. I need to wait for 20+ Minutes, while I am actually only interested in the current memory usage of the model (and not so much of not loaded columns)

Describe the solution you'd like Please introduce new parameter of generic configuration to disable the Querying of vertipaq analyzer column statistics via Spark.

Describe alternatives you've considered An alternative would be to just not do that - as dax studio supports.

Additional context See problem description

hmayer1980 commented 2 months ago

While I have been working on this for a while, more and more I get to the considerationt hat the "Vertipaq Analyser" Results should be selfcontained by the PowerBI engine. It would not do anything with Spark. I wanted to get the Memory Foot of all models on our Capacity - but the Models are not fully functional. the DAX Studio do not create if a column references a column that does not exist in the Lakehouse / Warehouse. If outputs the memory from the model - what is loaded. I on the other hand now just find Errors in the Models because the columsn where never used - but are in the model - and the Spark Queries do fail because of that. a Validation is good - but not in the Veripaq Analyzer Function - there I want to get the Memory footprint as its.

m-kovalsky commented 2 months ago

This was actually on my to do list already. A new parameter will be added to the vertipaq_analyzer function called 'read_stats_from_data'. It will default to False. Setting it to True will use spark to query the lakehouse (for Direct Lake models) or use DAX (for non-Direct Lake models) to obtain values for Column Cardinality and Missing Rows.

m-kovalsky commented 2 months ago

Added to 0.3.2