Closed zoof closed 10 years ago
Are you able to narrow it down to an example dataframe where you have the issue you can post here?
I have the dataframe that I found the problem with -- it is quite large but I can delete all but the two variables where the calculation goes awry and post the dataset somewhere.
If you can, please do, that would be interesting.
Do you have NaNs? And what is the result of stateemp.area.values.mean()
?
After some playing around, one hypothesis is that the bug has something to do with int32 vs int64 dtypes. So initially, I exported it to csv and tried it on another computer and I got the right answer. I then saved it as an hdf5 file and I got the wrong answer. Looking at the dtypes:
In [30]: stateemp.area.dtype
Out[30]: dtype('int32')
In [31]: stateemp.area.mean()
Out[31]: 58.927762478114879
and
In [34]: stateemp.area.dtype
Out[34]: dtype('int64')
In [35]: stateemp.area.mean()
Out[35]: 23785.447812211703
The two files are at: https://copy.com/cOl2YgRicfco https://copy.com/fddCxmq1Wpmp
In answer to your questions:
In [21]: stateemp.area.values.mean()
Out[21]: 23785.447812211703
In [22]: isnan(stateemp.area).sum()
Out[22]: 0
Could you post output of describe() both pre- and post-load?
I'm not sure what you mean by pre-load.
what @jtratner means is show EXACTLY what you are doing with those files, every command in an ipython session, so it can simply be copy pasted and reproduced
You mean something like:
Python 2.7.6 (default, Feb 26 2014, 12:01:28)
Type "copyright", "credits" or "license" for more information.
IPython 2.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
%guiref -> A brief reference about the graphical user interface.
In [1]: import pandas as pd
In [2]: df=pd.read_hdf('pandas_mean_error.h5','stateemp')
In [3]: df.area.describe()
Out[3]:
count 5249571.000000
mean 58.927762
std 22883.864925
min 0.000000
25% 0.000000
50% 21500.000000
75% 38900.000000
max 97961.000000
Name: area, dtype: float64
In [4]: df=pd.read_csv('pandas_mean_error.csv.gz',compression='gzip')
In [5]: df.area.describe()
Out[5]:
count 5249571.000000
mean 23785.447812
std 22883.864925
min 0.000000
25% 0.000000
50% 21500.000000
75% 38900.000000
max 97961.000000
Name: area, dtype: float64
In [6]:
That said, in addition to int32 vs int64, it seems to be architecture specific -- I've also done the same on a 64 bit machine where the mean gets computed correctly. I was doing the construction of the datasets using a 32 bit machine. I'm running one of the flavors of Manjaro linux on all my machines.
ok; that's shows the problem (but the reason we ask for ALL the info is you didn't mention that it was in an hdf file).
can you show what you are writing? (e.g. you are reading the file, but show the code to generate the frame that you are then writing).
it should be completely self-contained and copy-pastable.
I did in my comment from 4 days ago. As I also mentioned in that comment, it appeared to be a difference between int32 and int64 so it's not obvious to me why hdf5 matters -- I observed the problem before saving anything out to hdf5.
@zoof
ok then show how you created them
you are showing a sympton, but not how you created the data. So it is still impossible to even guess where the problem lies.
import pandas as pd
from numpy import nan,isnan
# set directories
rawdat = '/home/tct/rothstein/uiflows/rawdata'
scratch = '/home/tct/rothstein/uiflows/scratch'
stateemp = pd.read_csv(rawdat+'/miscbls/sm.data.1.AllData.gz',header=0,names=['seriesid','year','period','value','footnote'],converters={'footnote':str},sep='\t',compression='gzip')
stateemp['sa']=stateemp.seriesid.str[2]=='S'
stateemp['st_fips']=stateemp.seriesid.str[3:5].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['supersec']=stateemp.seriesid.str[10:12].astype(int)
stateemp['ind']=stateemp.seriesid.str[10:18].astype(int)
stateemp['type']=stateemp.seriesid.str[18:20].astype(int)
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
stateemp['month']=stateemp.period.str.replace('M','').astype(float)
stateemp.loc[stateemp.month<=12,'periodicity']='M'
stateemp.loc[stateemp.month==13,'periodicity']='A'
stateemp.loc[stateemp.periodicity=='A','month']=nan
del(stateemp['period'])
stateemp['footnote_txt']=''
stateemp.loc[stateemp.footnote=='1','footnote_txt']='series break'
stateemp.loc[stateemp.footnote=='P','footnote_txt']='prelim'
stateemp['footnote']=~ (stateemp.footnote_txt=='')
areas=pd.read_csv(rawdat+'/miscbls/sm.area',header=0,names=['area','area_name'],usecols=[0,1],sep='\t')
industry=pd.read_csv(rawdat+'/miscbls/sm.industry',header=0,names=['ind','ind_name'],usecols=[0,1],sep='\t')
state=pd.read_csv(rawdat+'/miscbls/sm.state',header=0,names=['st_fips','st_name'],usecols=[0,1],sep='\t')
stateemp=stateemp.merge(areas,on='area')
As you can see, I haven't done anything with area but read in another datafile and merge it into the larger dataset.
w/o the original files its still impossible to look
that said, don't do astype(int)
, do astype('int64')
; pandas keeps everything as int64
and float64
(you CAN use the other dtypes), however, int
is platform dependent (so will be 32-bit on 32 bit platforms and 64-bit on 64-bit platforms). try to avoid that in general. you could do int32
if you really want as well.
furthermore, better to use convert_objects(convert_numeric=True)
on the entire dataframe (or can do on the series). This will infer properly (and set non-valid entries to nan
).
Ok, thanks for the workaround but that doesn't imply that series means are not being computed incorrectly for certain dtypes and certain architectures.
In looking at the code again, the only dataset needed is, https://copy.com/722BDiVaQ7BL, and the code I posted can be truncated following:
stateemp['area']=stateemp.seriesid.str[5:10].astype(int)
can you show:
areas.head()
areas.dtypes
its possible that their is a bug in the merging, but I can't reproduce, that's why I am asking :)
As I pointed out in the prior comment, the merge command is not necessary (i.e., it is my astype(int) that is making it int32) but nevertheless:
In [11]: areas.head()
Out[11]:
area area_name
0 0 Statewide
1 10180 Abilene, TX
2 10380 Aguadilla-Isabela-San Sebastian, PR
3 10420 Akron, OH
4 10500 Albany, GA
[5 rows x 2 columns]
In [12]: areas.dtypes
Out[12]:
area int64
area_name object
dtype: object
ok, how about a sample of stateemp
(with the code you have as written)
All of stateemp is here: https://copy.com/722BDiVaQ7BL in 'stateemp'. I.e., read_hdf('pandas_mean_error.h5','stateemp'). The code was posted before but again, could be further truncated to reproduce the problem.
how did you write the hdf file? you have an odd filter in their
stateemp.to_hdf('pandas_mean_error.h5','stateemp',complevel=9,complib='bzip2')
Although I can't be absolutely sure which complib I used since I was doing it interactively to produce datasets for you folks to play with.
ok...can you save w/o bzip...i don't have installed (and generally not that efficient anyhow, use 'blosc')
Here is one that has been bloscified. https://copy.com/7lM72h3Br2Uy
ok.still doesn't show anything, this looks fine.
In [5]: df.dtypes
Out[5]:
seriesid object
year int64
value float64
footnote bool
sa bool
st_fips int32
area int32
supersec int32
ind int32
type int32
month float64
periodicity object
footnote_txt object
dtype: object
In [6]: df.area.mean()
Out[6]: 23785.447812211703
In [7]: df.area.describe()
Out[7]:
count 5249571.000000
mean 23785.447812
std 22883.864925
min 0.000000
25% 0.000000
50% 21500.000000
75% 38900.000000
max 97961.000000
Name: area, dtype: float64
Maybe it is very architecture specific -- the work was initially done on one of the early atom cpus (embarrassed grin) and then I also tested it on a P4. The atom's cpu flags are:
flags : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm movbe lahf_lm dtherm
I can get the P4 flags if that would be useful.
weird....
ok...use the methods I described above for conversions and generally keep to 64-bit dtypes
closing...reopen if you need
So why is int32 problematic? Presumably it takes less space. If I'm not wrong then there are cases where I'd even prefer to use int8, especially on RAM constrained machines.
int32
is not problematic at all. however astype(int)
IS (as it then is platform dependent).
u CAN use int8
if you want. but to be honest unless you are saving millions and millions of rows its not going to make any different (and could actually be slower as manipulating int64 is pretty optimized as a single instruction; i think int8's are not nearly so)
This bug is really bothering me so I installed a 32 bit linux (also Manjaro) on a spare partition of my 64 bit machine and same problem -- df.area.describe() gives me a mean of 58.9... So on three separate machines with 32 bit OSes, I'm observing this issue. I don't think this is an isolated case due to special hardware. I'm happy to try some alternate 32 bit flavor of linux (Ubuntu, Debian, what have you). Alternatively, if int32 is really not an important dtype, perhaps it should simply be dropped.
@zoof is certainly could be a bug on 32-bit. But need a simple reproducible test. Try to narrow it down to: a) construct a particular frame, b) perform an operation, c) works on 64, fails on 32 bit.
You have the dataset. I read it in on 32 bit OSes, ask it to df.area.describe() and I get 58.927762 for the mean. When I do the same on a 64 bit linux, I get a mean of 23785.447812. Do I need to do exhaustive tests on every conceivable hardware and versions of linux? BTW, I have now confirmed the problem on a live session of 32 bit Ubuntu 14.04 from which I am now typing.
you are missing the point
without a simple test i can't even begin to figure out where the problem is
in order for this to move forward you need to make a simple test
you can even read in a data set but it has to be short, preferably from a string
it could be a numpy, python or pandas bug it needs to be narrowed down
It does not seem to be a numpy bug:
In [8]: np.array(df.area).dtype
Out[8]: dtype('int32')
In [9]: np.array(df.area).mean()
Out[9]: 23785.447812211703
If it is a Python bug, it works in a way that does not affect numpy.
It is not clear to me why it has to be short.
Is this a standard method of operation that if the precise bug cannot be pinned down, the issue is closed? Despite the fact that it is clearly a bug. I'll say again, you have the dataset. You could produce the erroneous result if you tried it with a 32 bit linux distro.
it probably is a bug
but pandas has a test suite of almost 5000 tests
in order to patch anything it has to be a test
how can we include this massive data set in the test suite? it's simply not possible
you found and issue great - but u have to narrow it down by producing a much smaller example that can be put directly in the code
someone has to run the test and see that it fails on a particular platform
and then test a fix that works and does not break anything else
what u have provided is an indication of a bug but since it is not readily reproducible it is impossible to move forward
if you do narrow it down pls reopen the issue
The best I can do is cut the series down to about 90,000. Taking every 58th observation gets:
In [30]: df.loc[df.index/58.==df.index/58,'area'].describe()
Out[30]:
count 90510.000000
mean -23663.382599
std 22887.987276
min 0.000000
25% 0.000000
50% 21500.000000
75% 38900.000000
max 97961.000000
Name: area, dtype: float64
In [31]: len(df.loc[df.index/58.==df.index/58,'area'])
Out[31]: 90510
Using every 59th fails to produce an error.
pls show pd.print_versions()
I am looking to see if bottleneck
is installed
bottleneck
is not installed:
In [4]: pd.print_versions()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-be909eec5dd8> in <module>()
----> 1 pd.print_versions()
AttributeError: 'module' object has no attribute 'print_versions'
Should I install and try the calculation again?
sorry meant pd.show_versions()
in any event, this is a numpy bug, see https://github.com/numpy/numpy/issues/4638
installing bottleneck
will fix this on 32-bit, or use int64
dtypes, or work on 64-bit until numpy fixes
Many thanks! Will do so.
thanks for reporting. something bugs are hard to fine!
@zoof ok...going to fix on the pandas side, as this is broken in both bottleneck (for float32), suprisingly int32 works, and in numpy on both. they 'wont' fix as its the user responsibility to do the upcasting (which is odd because on 64-bit it works), because the return variable is already 64-bit, whilst on 32-bit it is not. weird.
That is weird. Seems to me that if upcasting the the user's responsibility, it should at least throw a warning or else how is the user to know that there is a problem. I guess use only int64 and float64.
their is a way to intercept it np.seterr()
, but pandas turns this off for a variety of reason.
its no big deal to fix once I saw the problem.
you see the importance of narrowing down the problem; have to be able to debug it.
In some cases, the mean is computed incorrectly. Numpy however does the correct calculation. There is no problem with the standard deviation calculation. The following is an example.