rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 327 forks source link

Difference in output when using pandas in Jupyter notebook and Reticulate #347

Closed engti closed 6 years ago

engti commented 6 years ago

xpost from here

Issue: When using pandas backfill function, the output is correct in a python notebook, but gives an incorrect result when the same code is called from R using the Reticulate package.

Context: I am trying to use a backfill function to do a last observation carried backwards. I used tidyr's fill function with .direction = "up" which works, but for my dataset it was taking more than an hour.

Sample Data: Minimal sample data is located here.

Jupyter Notebook Code So I used the following code in Python, which takes in the file, groups it by the user_id and then sorts by date and hour before applying the backfill function on the sale_index column:

import pandas
def read_backfill(file):
  filltest = pandas.read_csv(file)
  filltest = filltest.sort_values(['user_id', 'date', 'hour']).reset_index()
  filltest.sale_index = filltest.groupby(['user_id']).sale_index.bfill()
  return filltest

I called the above code with the following:

df1 = read_backfill('test1.csv')
df1.loc[df1['user_id']=="zxyu"]

Which gave me the correct output in 5 mins, which is awesome. Image of output here.

R Integration So far so good, but I am trying to integrate this into a single workflow within R as the plotting and the markdown creation is happening here. I used the Reticulate package to call the same function which was saved as backfill.py. But here there's an issue, it just didn't give me the correct output, unlike when I called it from iPython.

## load library
library(reticulate)

## py test using the python code from above example
source_python("backfill.py")
data <- read_backfill("test1.csv") 

It would give me the following output - image

Any idea what's going on? It's the same code, line for line. For some reason it seems to ignore the grouping and sorting, and perform the back fill incorrectly according to some logic I don't understand.

Any help would be most appreciated. Thanks.

Config Info My py_config() results below:

python:         C:\Users\uname\AppData\Local\CONTIN~1\ANACON~1\python.exe
libpython:      C:/Users/uname/AppData/Local/CONTIN~1/ANACON~1/python36.dll
pythonhome:     C:\Users\uname\AppData\Local\CONTIN~1\ANACON~1
version:        3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:\Users\uname\AppData\Local\CONTIN~1\ANACON~1\lib\site-packages\numpy
numpy_version:  1.14.3

python versions found: 
 C:\Users\uname\AppData\Local\CONTIN~1\ANACON~1\python.exe
 C:\Users\uname\JULIA~1\v0.6\Conda\deps\usr\python.exe
 C:\Users\uname\.julia\v0.6\Conda\deps\usr\python.exe
engti commented 6 years ago

After some experimenting, it seems to work as expected when I set the Python path manually:

## load library
Sys.setenv(RETICULATE_PYTHON = "C:/Users/uname/.julia/v0.6/Conda/deps/usr/python.exe")
library(reticulate)

I am not too sure why I have multiple Python paths, when I only installed Anaconda once, and this is a laptop has been only imaged recently. And why it exists within the Julia path, I don't know.