pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

Bug in pd.Series.mean() #6915

Closed zoof closed 10 years ago

zoof commented 10 years ago

In some cases, the mean is computed incorrectly. Numpy however does the correct calculation. There is no problem with the standard deviation calculation. The following is an example.

In [11]: np.array(stateemp.area.tolist()).mean()
Out[11]: 23785.447812211703

In [12]: stateemp.area.mean()
Out[12]: 58.927762478114879

In [13]: np.array(stateemp.area.tolist()).std()
Out[13]: 22883.862745218048

In [14]: stateemp.area.std()
Out[14]: 22883.864924811925

In [15]: pd.__version__
Out[15]: '0.13.1'

In [16]: np.__version__
Out[16]: '1.8.1'
jorisvandenbossche commented 10 years ago

Are you able to narrow it down to an example dataframe where you have the issue you can post here?

zoof commented 10 years ago

I have the dataframe that I found the problem with -- it is quite large but I can delete all but the two variables where the calculation goes awry and post the dataset somewhere.

jorisvandenbossche commented 10 years ago

If you can, please do, that would be interesting.

Do you have NaNs? And what is the result of stateemp.area.values.mean()?

zoof commented 10 years ago

After some playing around, one hypothesis is that the bug has something to do with int32 vs int64 dtypes. So initially, I exported it to csv and tried it on another computer and I got the right answer. I then saved it as an hdf5 file and I got the wrong answer. Looking at the dtypes:

In [30]: stateemp.area.dtype
Out[30]: dtype('int32')

In [31]: stateemp.area.mean()
Out[31]: 58.927762478114879

and

In [34]: stateemp.area.dtype
Out[34]: dtype('int64')

In [35]: stateemp.area.mean()
Out[35]: 23785.447812211703

The two files are at: https://copy.com/cOl2YgRicfco https://copy.com/fddCxmq1Wpmp

In answer to your questions:

In [21]: stateemp.area.values.mean()
Out[21]: 23785.447812211703

In [22]: isnan(stateemp.area).sum()
Out[22]: 0
jtratner commented 10 years ago

Could you post output of describe() both pre- and post-load?

zoof commented 10 years ago

I'm not sure what you mean by pre-load.

jreback commented 10 years ago

what @jtratner means is show EXACTLY what you are doing with those files, every command in an ipython session, so it can simply be copy pasted and reproduced

zoof commented 10 years ago

You mean something like:

Python 2.7.6 (default, Feb 26 2014, 12:01:28) 
Type "copyright", "credits" or "license" for more information.

IPython 2.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
%guiref   -> A brief reference about the graphical user interface.

In [1]: import pandas as pd

In [2]: df=pd.read_hdf('pandas_mean_error.h5','stateemp')

In [3]: df.area.describe()
Out[3]: 
count    5249571.000000
mean          58.927762
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

In [4]: df=pd.read_csv('pandas_mean_error.csv.gz',compression='gzip')

In [5]: df.area.describe()
Out[5]: 
count    5249571.000000
mean       23785.447812
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64

In [6]: 
zoof commented 10 years ago

That said, in addition to int32 vs int64, it seems to be architecture specific -- I've also done the same on a 64 bit machine where the mean gets computed correctly. I was doing the construction of the datasets using a 32 bit machine. I'm running one of the flavors of Manjaro linux on all my machines.

jreback commented 10 years ago

ok; that's shows the problem (but the reason we ask for ALL the info is you didn't mention that it was in an hdf file).

can you show what you are writing? (e.g. you are reading the file, but show the code to generate the frame that you are then writing).

it should be completely self-contained and copy-pastable.

zoof commented 10 years ago

I did in my comment from 4 days ago. As I also mentioned in that comment, it appeared to be a difference between int32 and int64 so it's not obvious to me why hdf5 matters -- I observed the problem before saving anything out to hdf5.

jreback commented 10 years ago

@zoof

ok then show how you created them

you are showing a sympton, but not how you created the data. So it is still impossible to even guess where the problem lies.

zoof commented 10 years ago
import pandas as pd
from numpy import nan,isnan

# set directories
rawdat = '/home/tct/rothstein/uiflows/rawdata'
scratch = '/home/tct/rothstein/uiflows/scratch'

stateemp = pd.read_csv(rawdat+'/miscbls/sm.data.1.AllData.gz',header=0,names=['seriesid','year','period','value','footnote'],converters={'footnote':str},sep='\t',compression='gzip')
stateemp['sa']=stateemp.seriesid.str[2]=='S'
stateemp['st_fips']=stateemp.seriesid.str[3:5].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['supersec']=stateemp.seriesid.str[10:12].astype(int)
stateemp['ind']=stateemp.seriesid.str[10:18].astype(int)
stateemp['type']=stateemp.seriesid.str[18:20].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['month']=stateemp.period.str.replace('M','').astype(float)
stateemp.loc[stateemp.month<=12,'periodicity']='M'
stateemp.loc[stateemp.month==13,'periodicity']='A'
stateemp.loc[stateemp.periodicity=='A','month']=nan
del(stateemp['period'])
stateemp['footnote_txt']=''
stateemp.loc[stateemp.footnote=='1','footnote_txt']='series break'
stateemp.loc[stateemp.footnote=='P','footnote_txt']='prelim'
stateemp['footnote']=~ (stateemp.footnote_txt=='')

areas=pd.read_csv(rawdat+'/miscbls/sm.area',header=0,names=['area','area_name'],usecols=[0,1],sep='\t')
industry=pd.read_csv(rawdat+'/miscbls/sm.industry',header=0,names=['ind','ind_name'],usecols=[0,1],sep='\t')
state=pd.read_csv(rawdat+'/miscbls/sm.state',header=0,names=['st_fips','st_name'],usecols=[0,1],sep='\t')

stateemp=stateemp.merge(areas,on='area')

As you can see, I haven't done anything with area but read in another datafile and merge it into the larger dataset.

jreback commented 10 years ago

w/o the original files its still impossible to look

that said, don't do astype(int), do astype('int64'); pandas keeps everything as int64 and float64 (you CAN use the other dtypes), however, int is platform dependent (so will be 32-bit on 32 bit platforms and 64-bit on 64-bit platforms). try to avoid that in general. you could do int32 if you really want as well.

furthermore, better to use convert_objects(convert_numeric=True) on the entire dataframe (or can do on the series). This will infer properly (and set non-valid entries to nan).

zoof commented 10 years ago

Ok, thanks for the workaround but that doesn't imply that series means are not being computed incorrectly for certain dtypes and certain architectures.

In looking at the code again, the only dataset needed is, https://copy.com/722BDiVaQ7BL, and the code I posted can be truncated following:

stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
jreback commented 10 years ago

can you show:

jreback commented 10 years ago

its possible that their is a bug in the merging, but I can't reproduce, that's why I am asking :)

zoof commented 10 years ago

As I pointed out in the prior comment, the merge command is not necessary (i.e., it is my astype(int) that is making it int32) but nevertheless:

In [11]: areas.head()
Out[11]: 
    area                            area_name
0      0                            Statewide
1  10180                          Abilene, TX
2  10380  Aguadilla-Isabela-San Sebastian, PR
3  10420                            Akron, OH
4  10500                           Albany, GA

[5 rows x 2 columns]

In [12]: areas.dtypes
Out[12]: 
area          int64
area_name    object
dtype: object
jreback commented 10 years ago

ok, how about a sample of stateemp (with the code you have as written)

zoof commented 10 years ago

All of stateemp is here: https://copy.com/722BDiVaQ7BL in 'stateemp'. I.e., read_hdf('pandas_mean_error.h5','stateemp'). The code was posted before but again, could be further truncated to reproduce the problem.

jreback commented 10 years ago

how did you write the hdf file? you have an odd filter in their

zoof commented 10 years ago
stateemp.to_hdf('pandas_mean_error.h5','stateemp',complevel=9,complib='bzip2')

Although I can't be absolutely sure which complib I used since I was doing it interactively to produce datasets for you folks to play with.

jreback commented 10 years ago

ok...can you save w/o bzip...i don't have installed (and generally not that efficient anyhow, use 'blosc')

zoof commented 10 years ago

Here is one that has been bloscified. https://copy.com/7lM72h3Br2Uy

jreback commented 10 years ago

ok.still doesn't show anything, this looks fine.

In [5]: df.dtypes
Out[5]: 
seriesid         object
year              int64
value           float64
footnote           bool
sa                 bool
st_fips           int32
area              int32
supersec          int32
ind               int32
type              int32
month           float64
periodicity      object
footnote_txt     object
dtype: object

In [6]: df.area.mean()
Out[6]: 23785.447812211703

In [7]: df.area.describe()
Out[7]: 
count    5249571.000000
mean       23785.447812
std        22883.864925
min            0.000000
25%            0.000000
50%        21500.000000
75%        38900.000000
max        97961.000000
Name: area, dtype: float64
zoof commented 10 years ago

Maybe it is very architecture specific -- the work was initially done on one of the early atom cpus (embarrassed grin) and then I also tested it on a P4. The atom's cpu flags are:

flags       : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm movbe lahf_lm dtherm

I can get the P4 flags if that would be useful.

jreback commented 10 years ago

weird....

ok...use the methods I described above for conversions and generally keep to 64-bit dtypes

closing...reopen if you need

zoof commented 10 years ago

So why is int32 problematic? Presumably it takes less space. If I'm not wrong then there are cases where I'd even prefer to use int8, especially on RAM constrained machines.

jreback commented 10 years ago

int32 is not problematic at all. however astype(int) IS (as it then is platform dependent).

u CAN use int8 if you want. but to be honest unless you are saving millions and millions of rows its not going to make any different (and could actually be slower as manipulating int64 is pretty optimized as a single instruction; i think int8's are not nearly so)

zoof commented 10 years ago

This bug is really bothering me so I installed a 32 bit linux (also Manjaro) on a spare partition of my 64 bit machine and same problem -- df.area.describe() gives me a mean of 58.9... So on three separate machines with 32 bit OSes, I'm observing this issue. I don't think this is an isolated case due to special hardware. I'm happy to try some alternate 32 bit flavor of linux (Ubuntu, Debian, what have you). Alternatively, if int32 is really not an important dtype, perhaps it should simply be dropped.

jreback commented 10 years ago

@zoof is certainly could be a bug on 32-bit. But need a simple reproducible test. Try to narrow it down to: a) construct a particular frame, b) perform an operation, c) works on 64, fails on 32 bit.

zoof commented 10 years ago

You have the dataset. I read it in on 32 bit OSes, ask it to df.area.describe() and I get 58.927762 for the mean. When I do the same on a 64 bit linux, I get a mean of 23785.447812. Do I need to do exhaustive tests on every conceivable hardware and versions of linux? BTW, I have now confirmed the problem on a live session of 32 bit Ubuntu 14.04 from which I am now typing.

jreback commented 10 years ago

you are missing the point

without a simple test i can't even begin to figure out where the problem is

in order for this to move forward you need to make a simple test

you can even read in a data set but it has to be short, preferably from a string

it could be a numpy, python or pandas bug it needs to be narrowed down

zoof commented 10 years ago

It does not seem to be a numpy bug:

In [8]: np.array(df.area).dtype
Out[8]: dtype('int32')

In [9]: np.array(df.area).mean()
Out[9]: 23785.447812211703

If it is a Python bug, it works in a way that does not affect numpy.

It is not clear to me why it has to be short.

Is this a standard method of operation that if the precise bug cannot be pinned down, the issue is closed? Despite the fact that it is clearly a bug. I'll say again, you have the dataset. You could produce the erroneous result if you tried it with a 32 bit linux distro.

jreback commented 10 years ago

it probably is a bug

but pandas has a test suite of almost 5000 tests

in order to patch anything it has to be a test

how can we include this massive data set in the test suite? it's simply not possible

you found and issue great - but u have to narrow it down by producing a much smaller example that can be put directly in the code

someone has to run the test and see that it fails on a particular platform

and then test a fix that works and does not break anything else

what u have provided is an indication of a bug but since it is not readily reproducible it is impossible to move forward

if you do narrow it down pls reopen the issue

zoof commented 10 years ago

The best I can do is cut the series down to about 90,000. Taking every 58th observation gets:

In [30]: df.loc[df.index/58.==df.index/58,'area'].describe()
Out[30]: 
count    90510.000000
mean    -23663.382599
std      22887.987276
min          0.000000
25%          0.000000
50%      21500.000000
75%      38900.000000
max      97961.000000
Name: area, dtype: float64

In [31]: len(df.loc[df.index/58.==df.index/58,'area'])
Out[31]: 90510

Using every 59th fails to produce an error.

jreback commented 10 years ago

pls show pd.print_versions() I am looking to see if bottleneck is installed

zoof commented 10 years ago

bottleneck is not installed:

In [4]: pd.print_versions()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-be909eec5dd8> in <module>()
----> 1 pd.print_versions()

AttributeError: 'module' object has no attribute 'print_versions'

Should I install and try the calculation again?

jreback commented 10 years ago

sorry meant pd.show_versions()

in any event, this is a numpy bug, see https://github.com/numpy/numpy/issues/4638

jreback commented 10 years ago

installing bottleneck will fix this on 32-bit, or use int64 dtypes, or work on 64-bit until numpy fixes

zoof commented 10 years ago

Many thanks! Will do so.

jreback commented 10 years ago

thanks for reporting. something bugs are hard to fine!

jreback commented 10 years ago

@zoof ok...going to fix on the pandas side, as this is broken in both bottleneck (for float32), suprisingly int32 works, and in numpy on both. they 'wont' fix as its the user responsibility to do the upcasting (which is odd because on 64-bit it works), because the return variable is already 64-bit, whilst on 32-bit it is not. weird.

zoof commented 10 years ago

That is weird. Seems to me that if upcasting the the user's responsibility, it should at least throw a warning or else how is the user to know that there is a problem. I guess use only int64 and float64.

jreback commented 10 years ago

their is a way to intercept it np.seterr(), but pandas turns this off for a variety of reason. its no big deal to fix once I saw the problem.

you see the importance of narrowing down the problem; have to be able to debug it.