pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

DOC: Flesh out the R comparison section of docs #3980

Closed hayd closed 10 years ago

hayd commented 11 years ago

I guess quite a lot of people come from an R background, and perhaps a good material would be a conversion table for pandas vs R functions/idioms etc. in http://pandas.pydata.org/pandas-docs/dev/comparison_with_r.html

Perhaps this site could offer some functions to consider including: http://www.statmethods.net/management/variables.html

jreback commented 11 years ago

I guess R is famous for obfuscation (of syntax)?

hayd commented 11 years ago

I suspect it'll be a many-to-one table :)

cpcloud commented 11 years ago

ha! u guys are funny. what the heck is attach? is like attach(x) == globals()['x'] = x?

jreback commented 11 years ago

isn't R intuitive?

cpcloud commented 11 years ago

where the heck are cyl and vs coming from? This

attach(mtcars)
aggdata <- aggregate(mtcars, by=list(cyl,vs), FUN=mean, na.rm=TRUE)
detach(mtcars)

works only if you do the attach(mtcars)? wtf are the scoping rules in R? no such thing exists in Python without a lot of magic...

jtratner commented 11 years ago

Attach basically is like saying 'make all of the columns of the data frame global variables'

jtratner commented 11 years ago

It has a companion method detach. I think there's also a with - like statement that scopes just to the function call. Have you seen the model syntax yet? a ~ b I totally get that it's useful, but it's a little unsettling when you are used to being able to explicitly trace all names in the document.

cpcloud commented 11 years ago

patsy + statsmodels + pandas >>>>> R

cpcloud commented 11 years ago

magic regarding scope and namespaces :-1:

cpcloud commented 11 years ago

anyway comparisons are useful to show people how awesome pandas is :)

hayd commented 11 years ago

related http://stackoverflow.com/questions/17621325/equivalent-pandas-function-to-this-r-aggregation

Anyone fancy spamming the pandas/R/.. mailing lists to see if anyone is interested in doing this?

hayd commented 11 years ago

http://stackoverflow.com/questions/18005305/implementing-r-scale-function-in-pandas-in-python

TomAugspurger commented 11 years ago

http://stackoverflow.com/questions/19237878/subsetting-a-python-dataframe

hayd commented 10 years ago

https://groups.google.com/forum/#!topic/pydata/1eNURQsflNw

A while back I started making some notes on how to do the various recipes in O'Reilly's R Cookbook (http://shop.oreilly.com/product/9780596809164.do) with Numpy, Pandas, Scipy.

I haven't had time to complete it so I'm sharing it in it's current state, and trying to get some community help to fill in the gaps.

I think this could be an extremely useful resource to encourage and help transition lots of people from R to Pandas.

So here's the notes:

http://notes.lexual.com/tech/r_numpy_pandas_cookbook.html

And here's the github repo, patches more than welcome!

https://github.com/lexual/sphinx-notes/blob/master/source/tech/r_numpy_pandas_cookbook.rst

Cheers,

Lex.

These look useful, shame there are some sections which are XXX-titled, as would be nice to have a todo list on this for areas to flesh out.

8bit-pixies commented 10 years ago

I know that this section is more for pandas vs R, but I'm wondering would it be worthwhile to place some of the R functions if it isn't really related to pandas, for example: aaply, alply, or maybe dlply?

jreback commented 10 years ago

@chappers

want to add this: http://stackoverflow.com/questions/20905713/equivalent-of-rs-tapply-in-python-pandas

8bit-pixies commented 10 years ago

hmm, would you want this to go under the reshape/cast section, or in the with section, since it could be done in R using dcast as well:

mydf <- data.frame(
  Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1', 'Animal2', 'Animal3'),
  FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
  Amount = c(10, 7, 4, 2, 5, 6, 2)
)

# Stackoverflow example
with(mydf, tapply(Amount, list(Animal, FeedType), sum))

# Using reshape
require(reshape2)
dcast(mydf, Animal ~ FeedType, sum, fill=NaN)

In either case the solution would be whats in Stackoverflow (and very similar to the solution in the reshape/cast section of the current docs).

jreback commented 10 years ago

you can out under the more common / useful and put a link / statement in the other (as they r in the same page)

read it as if you are an R user doing the most common operation (eg what is normally recommended to R Users) and you want to convert to pandas

jreback commented 10 years ago

there are of course similar cases in pandas where multiple solutions present (eg imagine a vectorized function vs using apply)

one solution maybe faster or simpler or they may both be appropriate

jreback commented 10 years ago

think this is closable after the multiple PR's by @chappers