microsoft / jupyter-Kqlmagic

Extension (Magic) to Jupyter notebook and Jupyter lab, that enable notebook experience working with Kusto, ApplicationInsights, and LogAnalytics data.
Other
85 stars 31 forks source link

to_dataframe() is extremely slow. #104

Open vtsyganok-microsoft opened 1 year ago

vtsyganok-microsoft commented 1 year ago

I use KQLMagic to train ML models. For this I need data preferably in Panda DataFrame. My dataset has ~15 columns + 2mln rows and it takes about 2x longer to convert already retrieved data to panda then to retrieve it. Somewhat average timings are: query execution to completion (which includes data fetch) ~2min result.to_dataframe() ~5min.

(And I am working to further increase dataset to ~45-60 columns and most likely to have better precision I will need increase number of rows to 3-5mln) and it makes to_dataframe() prohibitively expensive.

However, if instead of built-in to_dataframe() I use pd.DataFrame.from_dict(results.to_dict()) it's significantly faster: on 15 columns 2 mln rows dataset it completes in about 40s (e.g. ~7-8x faster, and explainable faster that data retrieval)

I see similar kind of gains on larger dataset, but it's work in progress as I've spent few days to figure out how to speed up cost-prohibitive conversion to Panda DF.

It would be helpful to optimize built-in to_dataframe() to be at least on par with pd.DataFrame.from_dict(results.to_dict()) version.

Feel free to reach out to me internally on MS channels (vtsyganok), I can share notebooks and code/examples internally.

Thank you, Vadim

mbnshtck commented 5 months ago

Thank you for your comment. WIll check it.