microsoft / datamations

https://microsoft.github.io/datamations/
Other
66 stars 14 forks source link

Port the salary example to Python #104

Closed jhofman closed 2 years ago

jhofman commented 2 years ago

Let's see if we can get the json for the key frames of this datamation when it's written in Python/Pandas:

"small_salary %>% 
  group_by(Degree) %>%
  summarize(mean = mean(Salary))" %>%
  datamation_sanddance()

In Python the analysis could should be something more like this:

small_salary.groupby("Degree").mean("Salary")

where small_salary is a Pandas dataframe.

Challenges here:

  1. What's the right line of Pandas code to mimic the R code above (the eval() function)?
  2. How do we programmatically parse Python / Pandas code (from a string, for instance)?
  3. We need the intermediate data frames at each step
  4. And then need to generate datamation-compatible json blobs of vegalite specs

CSV of the data is here

cc @chisingh until your account is added to the repo.

jhofman commented 2 years ago

@chisingh has a first cut working where you pass a string to be eval-ed and the dataframe to operate on. it would be great if we could somehow avoid having to pass the dataframe as well, but it's not a big deal if we can't get around this.

but maybe there's a way to solve both this problem and the next step (parsing the pipeline) at once: could we somehow inherit and extend the DataFrame class to get full access to what's going on with a pandas pipeline and then just tack on a .datamate() function to the class?

if we could it might solve the problem of having to parse the pandas pipeline manually because we'd have access to internal state and could see what happens at each step.

this would help with tricky cases like the following: df.groupby("Work").mean() implicitly takes the mean of the salary column, even though it's not written out. pandas clearly knows this and it would be nice if we didn't have to discover and duplicate these things, but could instead just catch each operation from pandas.

it's possible that we can learn something from @dodger487's dplython package or the similar pandas-ply package in terms of hacks pandas hacking.

jhofman commented 2 years ago

helpful tips from @dodger487, who recommended we check out the following libraries:

https://docs.dask.org/en/stable/10-minutes-to-dask.html https://www.sympy.org/en/index.html https://github.com/machow/siuba

jhofman commented 2 years ago

@giorgi-ghviniashvili: @chisingh has python code working that generates an array of specs and is now looking into rendering animations within a jupyter notebook.

can you two discuss how to integrate the javascript code into jupyter? it looks like it might be as simple as calling the App() with the right jupyter widget/plugin, similar to R's htmlwidget?

giorgi-ghviniashvili commented 2 years ago

There are bunch of questions on Stackoverflow about embedding js code to jupyter notebook.

https://stackoverflow.com/questions/48248987/inject-execute-js-code-to-ipython-notebook-and-forbid-its-further-execution-on-p

from IPython.display import display, HTML, Javascript using display

There is also cell magic commands : %%javascript or %%js: image

@chisingh can you try to import all the dependencies in the notebook and then call App?

All the Githubissues.

  • Githubissues is a development platform for aggregating issues.