sql-bi / VertiPaq-Analyzer

MIT License
76 stars 13 forks source link

Enhancement - Make the number of Columns in LoadColumnStatistics configurable #131

Closed corycundy closed 2 months ago

corycundy commented 2 months ago

Hello, I am running into a problem when trying to run VertiPaq Analyzer against a Direct Lake model. The model is much larger than the SKU size and our capacity is constantly running out of memory. This is being addressed from a model standpoint, however, in the mean time I wanted to get a full view of the columns. I ran the VertiPaq Analyzer both from DAX Studio and Tabular Editor 3. In DAX Studio I set Read statistics from data and in TE3 I used Read Statistics from data and set Statistics from Direct Lake models to Full.

The problem is that the there are many large columns that can easily fill up the memory. The VertiPaq Analyzer appears that it tries to get column statistics for 50 columns at a time. This simply doesn't work as running the EVALUATE UNION ( ROW ( .... over 50 large columns exceeds the SKU memory limit for a model. I strongly believe this would work fine if I was able to adjust the 50 down to a much lower number, maybe even 1. I understand the optimization to run one EVALUATE for 50 columns at a time, for my specific instance, I'd rather see it churn through 50 individual queries.

I can't imagine there is any way to configure this at the moment as 50 is hard coded into the code. My request would be that this could be configurable even if it is possible via a config file vs. a UI setting in DAX Studio or TE3.

Here is the code that I believe is causing 50 columns to be read. The profiler for some reason only had 49 in the DAX query.

image

Here is an example of the DAX query from the profiler.

image
marcosqlbi commented 2 months ago

Hi Cory, this is a good point - however, as you can imagine, there are multiple steps involved: 1) Add the parameter to the VPAX Library (@albertospelta, what do you think about it?) 2) Expose that parameter in DAX Studio (and Tabular Editor) configuration. We'll have to open a similar Enhancement request there.

Marco

albertospelta commented 2 months ago

@marcosqlbi we can definitely add the new parameter to the public API. To avoid breaking changes for current users, we'll keep the default value at 50. This way, existing setups will continue to function as they do now.

corycundy commented 2 months ago

I know there is work involved and I know the best solution should be implemented. I have a few suggestions that may speed up the time of development to implement.

While this may not seem like the best solution in the end, it would allow for immediate use any existing client tools. It would also allow me to run a full VertiPaq scan on my existing model sooner! The application could read from an appSetting from the app.Config. The logic could then use the configuration value instead of the hardcoded 50 and if it didn't exist in the app.config it could default to 50. For example, DaxStudio.exe.config an entry could be added such as .The same could be done in Tabular Editor 3. Later this could become more of a parameter that the client tools could pass in.

In the end, this may rarely be needed, so exposing it in the UI of client tools would be fine, but most often not used. If it ends up there at some point great, but I would currently be happy with this behind the scenes configuration approach as a starting point. This suggestion is really an effort to make this a simple modification and expedite the time it takes.

Unless I'm missing something, this may only be a few lines of code. I'm curious of others thoughts on this idea.

marcosqlbi commented 2 months ago

@corycundy I am not sure that reading the config from a multi-platform library is possible, and even in that case, I do not that idea much! However, @albertospelta what do you think about that?

albertospelta commented 2 months ago

@corycundy I agree with Marco as implementing it in a multi-platform context could be complex and difficult to maintain. The VPAX library can't always know how the host application handles its configuration, and it might not always be through an app.config file. Furthermore, modifying a configuration file in the program's installation directory usually requires admin privileges, which can be difficult for many users.

Ideally, this parameter should be available as advanced settings in the host application's UI, but exposing that parameter might take some time. Meanwhile, I was wondering if you could use our command-line tool Dax.Vpax.CLI, which we recently released. We can definitely add this parameter to the CLI and get it out soon. Would you be able to use this tool to extract your VPAX files temporarily until the new setting is available in other tools?

otykier commented 2 months ago

@albertospelta could you let me know once the public VertiPaq Analyzer NuGet contains this parameter? Then we'll make sure to surface it in TE3's Preferences dialog.

albertospelta commented 2 months ago

@otykier Sure thing! I’ll let you know as soon as the NuGet is available. Thanks!

albertospelta commented 2 months ago

@otykier the new parameter is available in version v1.7.0

albertospelta commented 2 months ago

@corycundy The new parameter columnBatchSize has been added to the VPAX library and will be included in TE3's preferences dialog. I have also opened an issue for support in DaxStudio here.

If you would like, you can use this setting with the command-line tool, which now supports it. To get started, install the CLI by running dotnet tool install Dax.Vpax.CLI --global from your command prompt. After installation, you can use the new --column-batch-size option to configure number of colums to query at a time. For a complete list of available options, run vpax export --help.

corycundy commented 2 months ago

Hello @albertospelta , this change is great and super timely. Thank you so much for implementing this! I was able to call the CLI version with this new parameter. However, I unfortunately stumbled upon a few potential bugs.

I was not able to get --column-batch-size to work with a value of 1, which is my ultimate goal. Even 2 causes me issues in my current Direct Lake model. We are hitting memory error and are working with Microsoft CAT and support on this. I got the following error from the command line. This was against an Adventure Works Internet Sales.

Extracting VPAX metadata...Unhandled exception: Microsoft.AnalysisServices.AdomdClient.AdomdErrorResponseException: The end of the input was reached.

I was able to get the following from DAX Studio.

The end of the input was reached.

Query Text:

EVALUATE

Looking at the code, I don't see how this can happen unless the set of validColumns is slightly different than the results of the .Where statement that looks at both IsRowNumber and GroupByColumns. I tested multiple models with a parameter of 1 and received this error.

Secondly, on one Direct Lake model, I also received the following error which seems to be related. By looking at the code, I also don't understand this error either.

Extracting VPAX metadata...Unhandled exception: Microsoft.AnalysisServices.AdomdClient.AdomdErrorResponseException: Query (1, 10) Too few arguments were passed to the UNION function. The minimum argument count for the function is 2.

I was able to get the following from DAX Studio. Query (1, 10) Too few arguments were passed to the UNION function. The minimum argument count for the function is 2.

Query Text:

EVALUATE UNION( ROW("Table", "0000Date", "Column", "0001Update Timestamp", "Cardinality", DISTINCTCOUNT('Date'[Update Timestamp])))

I couldn't reproduce this one on a second model. It somehow is adding the UNION with only column. I don't believe I have any Group By columns in any tables in the models I am testing with. I hope that the fix to the first one turns out fixing this one as well.

Please let me know if you have any questions that I can help with regarding this. I would appreciate any additional help working through these issues.

Cory

albertospelta commented 2 months ago

Hi @corycundy, I am able to reproduce the first error ("The end of the input was reached.") and have a fix for it. Regarding the second error ("Too few arguments were passed to the UNION function"), I have a couple of questions:

Thanks

corycundy commented 2 months ago

@albertospelta, thanks for the quick response. I used 3 for the --column-batch-size when I received that error. Here is the command line with pathing and connection details modified.

C:\Users\Cory.Cundy>vpax export "C:\Temp\ModelName.vpax" "Provider=MSOLAP;Data Source=powerbi://api.powerbi.com/v1.0/myorg/WorkspaceName;Initial Catalog=ModelName" --column-batch-size 3 Extracting VPAX metadata...Unhandled exception: Microsoft.AnalysisServices.AdomdClient.AdomdErrorResponseException: Query (1, 10) Too few arguments were passed to the UNION function. The minimum argument count for the function is 2.

As far as Group By Columns, it turns out that there was one single column in the model that did have a Group By Column assigned.

Cory

otykier commented 2 months ago

@albertospelta we saw a similar error as Cory when reducing the column batch size below 5.

image

I'm guessing a check is needed somewhere, to ensure that UNION is only added when the column batch contains 2 or more columns.

albertospelta commented 2 months ago

@otykier @corycundy The NuGet package version 1.7.1 and the CLI tool version 1.2.2 have just been released and include fix for the incorrect UNION syntax. Please let me know if this has resolved the issues, thanks!

corycundy commented 2 months ago

@albertospelta, my testing on both issues has been successful. I am no longer getting either error.

One other observation that I made is that similar logic is used in the LoadTableStatistics method where 50 is hard coded. var loopTables = tableList.SplitList(50); This is not causing an issue for me and my guess is that since the generated query is using COUNTROWS vs. DISTINCTCOUNT that this is a often metadata operation (unless Direct Query) since there are not any SE queries generated (Import and Direct Lake) in my limited testing.

Thanks again for the quick turnaround time on this!

otykier commented 2 months ago

@corycundy happy to let you know that the new Column Batch Size option is now available in Tabular Editor 3.17.0, using the latest VertiPaq Analyzer NuGet (1.7.1).

albertospelta commented 2 months ago

I am closing this as the new parameter has been added to the settings in both Tabular Editor 3 and DAX Studio. We have created a dedicated issue here to track the suggestion for adding a new parameter for the number of tables as well. Thanks @corycundy