microsoft / vscode-data-wrangler

Other
437 stars 20 forks source link

Opening from a csv file, support Polars code execution in custom operations #342

Open biiiipy opened 3 days ago

biiiipy commented 3 days ago

Environment data

Expected behaviour

Data Wrangler can execute Polars code in custom operations window

Actual behaviour

Exception: Image

I think it is because opening plain csv doesn't have any runtime context, so it defaults to using pandas and doesn't support other libraries.

Steps to reproduce:

  1. Open a CSV file
  2. Click "Open in Data Wrangler"
  3. Write a Polars df code like df = df.sort("value", reverse=True) and execute

Logs

Output for Jupyter in the Output panel (ViewOutput, change the drop-down the upper-right of the Output panel to Jupyter)

``` AttributeError: 'DataFrame' object has no attribute 'sort' --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_40096\1255212943.py in ?(code, old_ns, new_ns) 38 name = get_ipython().compile.cache(code) 39 except Exception: 40 name = "" 41 ---> 42 exec(compile(code, name, 'exec'), session['namespaces']["create"](new_ns, old_ns)) ~\AppData\Local\Temp\ipykernel_40096\2332454470.py in ?() 1 # Sort by value in descending order ----> 2 df = df.sort("value", reverse=True) ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\generic.py in ?(self, name) 6295 and name not in self._accessors 6296 and self._info_axis._can_hold_identifiers_and_holds_name(name) 6297 ): 6298 return self[name] -> 6299 return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'sort' ```

pwang347 commented 4 hours ago

Hi @biiiipy, thanks for opening this issue! We don't currently support using Polars code to manipulate the DataFrame (we only support loading from Polars DataFrames by converting it into Pandas).

For your use-case, do you mostly care that the exported code is in Polars? (e.g. you interact with the DataFrame during the interactive Data Wrangler session using the built-in operations UI and Pandas, and we translate the code on export, which can be used in a data pipeline written with Polars)

Or alternatively, is it more important to be able to work directly in Polars (for example, you have really large files you are working with locally that you are unable to effectively sample).

Thanks!