has2k1 / scikit-misc

Miscellaneous tools for data analysis and scientific computing
https://has2k1.github.io/scikit-misc/stable
BSD 3-Clause "New" or "Revised" License
37 stars 9 forks source link

Discrepancy between R and Python `loess` smoothing results #38

Open ToryDeng opened 3 months ago

ToryDeng commented 3 months ago

Thank you for your excellent package! It works great for smoothing lines.

I recently wanted to use loess to smooth a surface with two variables. I have some sample data here: sample_data.csv. The two independent variables are coord1 and coord2, and the response variable is expression.

In R, my code using the loess() function from the stats package produces reasonable results:

library(tidyverse)
library(patchwork)

# read the CSV file
data <- read_csv("./sample_data.csv")

# loess smoothing
model <- loess(expression ~ coord1 + coord2, data = data, span = 0.3)
data["fitted"] <- model$fitted

# visualize the real expression values
p1 <- ggplot(data, aes(x = coord1, y = coord2, color = expression)) +
  geom_point(size = 1) +  # change the point size for better visualization
  scale_colour_distiller(palette = "YlOrRd", direction = 1) +
  ggtitle("Real expression") +
  theme_minimal()

# visualize the fitted expression values
p2 <- ggplot(data, aes(x = coord1, y = coord2, color = fitted)) +
  geom_point(size = 1) +  # change the point size for better visualization
  scale_colour_distiller(palette = "YlOrRd", direction = 1) +
  ggtitle("Fitted values") +
  theme_minimal()

# combine the two plots
p <- p1 / p2 +
  plot_layout(widths = unit(10, 'cm'), heights = unit(c(10, 10), c('cm', 'cm')))
ggsave("plot.png", p)

The resulting plot in R looks like this:

plot

However, when I attempt to perform loess smoothing in Python, I get quite different results. Here is my Python code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from skmisc.loess import loess

# loess smoothing
data = pd.read_csv("sample_data.csv")
model = loess(data.iloc[:, :2], data.iloc[:, 2], span=0.3, degree=2)
model.fit()
data["fitted"] = model.outputs.fitted_values

# visualize the real and fitted expression values
sns.set_theme(style="whitegrid")
fig, axs = plt.subplots(2, 1, figsize=(8, 12))
sns.scatterplot(data, x='coord1', y='coord2', hue='expression', palette='YlOrRd', ax=axs[0])
sns.move_legend(axs[0], "center left", bbox_to_anchor=(1, 0.5))
sns.scatterplot(data, x='coord1', y='coord2', hue='fitted', palette='YlOrRd', ax=axs[1])
sns.move_legend(axs[1], "center left", bbox_to_anchor=(1, 0.5))

The resulting plot in Python looks like this:

image

Some information that might be useful:

import skmisc

skmisc.show_config()
print(skmisc.__version__)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /usr/local/include
    lib directory: /usr/local/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= HASWELL MAX_THREADS=2
    pc file directory: /usr/local/lib/pkgconfig
    version: 0.3.21.dev
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /usr/local/include
    lib directory: /usr/local/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= HASWELL MAX_THREADS=2
    pc file directory: /usr/local/lib/pkgconfig
    version: 0.3.21.dev
Compilers:
  c:
    commands: cc
    linker: ld.bfd
    name: gcc
    version: 10.2.1
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.6
  fortran:
    commands: gfortran
    linker: ld.bfd
    name: gcc
    version: 10.2.1
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
  cross-compiled: true
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
Python Information:
  path: /opt/python/cp311-cp311/bin/python
  version: '3.11'

0.3.1

The results are quite different from those in R. Although I've spent a lot of time on this, I still can't pinpoint the issue or determine if there's a bug in the package.

Could you please help me figure out what might be going wrong? Any insights or suggestions would be appreciated!

has2k1 commented 3 months ago

This looks wrong and I have reproduced it. I will look into it.

ToryDeng commented 2 months ago

Any updates?

has2k1 commented 2 months ago

For 1D loess, R and python agree. For 2D (and probably more), they disagree. So far, I have been unable to find out why.

The original loess implementation is dloess from netlib.org and both implementations are derived from that and the core code is mostly the same.

Now, that original code includes tests for 2D loess together with expected results. We use the same tests and get the exact (presumably correct) results. R has different results for the same 2D fit, yet as this issue shows, R's 2D fit looks like it is the correct one, if any one of them is!

I suspected something with row-major vs column-major order, but it didn't seem to be. I'll have to look a lot more closely.

ToryDeng commented 2 months ago

Thanks a lot for your efforts! It seems like this issue is pretty tricky to solve. I'll switch to the R implementation for now but will keep an eye out for any updates on this issue.

For 1D loess, R and python agree. For 2D (and probably more), they disagree. So far, I have been unable to find out why.

The original loess implementation is dloess from netlib.org and both implementations are derived from that and the core code is mostly the same.

Now, that original code includes tests for 2D loess together with expected results. We use the same tests and get the exact (presumably correct) results. R has different results for the same 2D fit, yet as this issue shows, R's 2D fit looks like it is the correct one, if any one of them is!

I suspected something with row-major vs column-major order, but it didn't seem to be. I'll have to look a lot more closely.