to_dataframe() is extremely slow.

I use KQLMagic to train ML models. For this I need data preferably in Panda DataFrame. My dataset has ~15 columns + 2mln rows and it takes about 2x longer to convert already retrieved data to panda then to retrieve it. Somewhat average timings are: query execution to completion (which includes data fetch) ~2min result.to_dataframe() ~5min.

(And I am working to further increase dataset to ~45-60 columns and most likely to have better precision I will need increase number of rows to 3-5mln) and it makes to_dataframe() prohibitively expensive.

However, if instead of built-in to_dataframe() I use pd.DataFrame.from_dict(results.to_dict()) it's significantly faster: on 15 columns 2 mln rows dataset it completes in about 40s (e.g. ~7-8x faster, and explainable faster that data retrieval)

I see similar kind of gains on larger dataset, but it's work in progress as I've spent few days to figure out how to speed up cost-prohibitive conversion to Panda DF.

It would be helpful to optimize built-in to_dataframe() to be at least on par with pd.DataFrame.from_dict(results.to_dict()) version.

Feel free to reach out to me internally on MS channels (vtsyganok), I can share notebooks and code/examples internally.

Thank you, Vadim

microsoft / jupyter-Kqlmagic

to_dataframe() is extremely slow. #104