rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.86k stars 857 forks source link

Draft: New feature: Paired permutation test #768

Closed fjhheras closed 3 years ago

fjhheras commented 3 years ago

Code of Conduct

Description

I would like to add the option to perform permutation tests when the two populations are paired. http://axon.cs.byu.edu/Dan/478/assignments/permutation_test.php

The current pull request is just a first proposal. If you think this feature is useful enough to be implemented in mlxtend, I can improve documentation/style and introduce non-trivial tests (currently it is not well tested, so it can still be buggy).

Related issues or pull requests

Fixes #767

Pull Request Checklist

pep8speaks commented 3 years ago

Hello @fjhheras! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2021-03-03 01:33:55 UTC
coveralls commented 3 years ago

Coverage Status

Coverage decreased (-0.04%) to 90.734% when pulling f51e028fd365082a4058b5e84a119c77dfc1298e on fjhheras:paired_permutation into 894dce8f12d8d766c852150553b23f9f3d79d5af on rasbt:master.

rasbt commented 3 years ago

Sorry for getting back to it so late. I had a weirdly configured email client and haven't been getting Github notifications last month.

Anyways, thanks for the PR, I really appreciate it! I think there were some issues with the methodology though. Coincidentally, I was teaching a stats course last semester (unfortunately, as per syllabus, I had to use R :P) where I implemented a paired permutation (randomization) test. As far as I know, the standard variant is just flipping the pairs. Below is my R reference implementation that matches the textbook results for a small toy dataset:

pairedSamples<-function(diff,nSim=1000000){
  d <- diff

  ld <- length(d)
  obsmeand <- mean(d)
  meandvec <- rep(NA,nSim)
  for (i in 1:nSim){
    rsigns <- 2 * rbinom(ld, 1, .5) - 1
    newd <- rsigns * d
    newmeand <- mean(newd)
    meandvec[i] <- newmeand
  }
  pl <- sum(meandvec >= obsmeand)
  pg <- sum(meandvec <= obsmeand)
  pval <-  ifelse(pl < pg, 2 * pl/nSim, 2 * pg/nSim)
  options(scipen=1)
  if (pval==0){
    pval <- paste("p <",1/nSim)
  } else {
    pval <- paste("p =",pval) 
  }
  options(scipen=0)

  hhh <- hist(meandvec, plot=FALSE)
  hist(meandvec, xlab="Mean of differences", 
       main=paste("Paired Sample Randomization Test,", pval),
       probability=TRUE, xlim=range(c(hhh$breaks, obsmeand)),
       breaks=100, col="tan")
  abline(v=obsmeand, col="red")
  mtext(paste(round(obsmeand, digits=4)), side=1, at=obsmeand)
  }

nineteen80 = c(3.67, 1.72, 3.46, 2.60, 2.03, 2.10, 3.01)
nineteen90 = c(2.11, 1.79, 2.71, 1.89, 1.69, 1.71, 2.01)

Unknown

For some reason, using the PR's code I couldn't reproduce these results. I think the problem were the arrays of different lengths. Tried to debug that but couldn't really figure out to fix it so I rewrote it using a more "naive" implementation (probably too many for-loops). However, it seems to work and reproduce the expected results now (added some unit tests).

Please let me know your feedback!

fjhheras commented 3 years ago

Well, that was embarrassing. My local version works fine, but somehow I forgot to add a np.where when doing the PR. My original implementation in the PR seems to output the right values after this change

indices_1 = np.where(rng.randn(n) > 0)[0]

instead of

indices_1 = rng.randn(n) > 0

In any case, the amended implementation you propose is much easier to follow. I checked times, and it is also a bit faster!

rasbt commented 3 years ago

No worries, and thanks for double-checking! Wow, I am positively surprised that this is actually not that slow. So, maybe we should keep the current version then ;). I add the docs.

Sandy4321 commented 3 years ago

can you share full python code example pls how to use this new option

rasbt commented 3 years ago

I think this is example 3 here: http://rasbt.github.io/mlxtend/user_guide/evaluate/permutation_test/#example-3-paired-two-sample-randomization-test