Open LoneStar134 opened 9 years ago
can u post a copy pastable example and pd.show_versions()
Thanks for the quick response.
See the following Git: https://github.com/LoneStar134/read_fwf_issue
show_versions() output as requested:
In[4]: pd.show_versions()
commit: None python: 2.7.7.final.0 python-bits: 64 OS: Darwin OS-release: 14.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None
pandas: 0.16.2 nose: 1.3.3 Cython: 0.20.1 numpy: 1.8.1 scipy: 0.14.0 statsmodels: 0.5.0 IPython: 2.1.0 sphinx: 1.2.2 patsy: 0.2.1 dateutil: 1.5 pytz: 2014.3 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.3.1 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.5 lxml: 3.3.5 bs4: 4.3.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 0.9.4 pymysql: None psycopg2: None
need you to paste code inline that I can just copy paste to try
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 500)
def conv_str(x):
return str(x)
col_widths = [3, 2, 2, 2, 2, 11, 2, 2, 3, 3, 2, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
col_names = ['AGE', 'AGE_NEONATE', 'AMONTH', 'AWEEKEND', 'DIED', 'DISCWT', 'DISPUNIFORM', 'DQTR', 'DRG', 'DRG24', 'DRGVER', 'DRG_NoPOA', 'DX1', 'DX2', 'DX3', 'DX4', 'DX5', 'DX6', 'DX7', 'DX8', 'DX9', 'DX10', 'DX11', 'DX12', 'DX13', 'DX14', 'DX15', 'DX16', 'DX17', 'DX18', 'DX19', 'DX20', 'DX21', 'DX22', 'DX23', 'DX24', 'DX25']
col_conv = {'AGE': conv_str, 'AGE_NEONATE': conv_str, 'AMONTH': conv_str, 'AWEEKEND': conv_str, 'DIED': conv_str, 'DISCWT': conv_str, 'DISPUNIFORM': conv_str, 'DQTR': conv_str, 'DRG': conv_str, 'DRG24': conv_str, 'DRGVER': conv_str, 'DRG_NoPOA': conv_str, 'DX1': conv_str, 'DX2': conv_str, 'DX3': conv_str, 'DX4': conv_str, 'DX5': conv_str, 'DX6': conv_str, 'DX7': conv_str, 'DX8': conv_str, 'DX9': conv_str, 'DX10': conv_str, 'DX11': conv_str, 'DX12': conv_str, 'DX13': conv_str, 'DX14': conv_str, 'DX15': conv_str, 'DX16': conv_str, 'DX17': conv_str, 'DX18': conv_str, 'DX19': conv_str, 'DX20': conv_str, 'DX21': conv_str, 'DX22': conv_str, 'DX23': conv_str, 'DX24': conv_str, 'DX25': conv_str}
print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8'
df1 = pd.read_fwf('file:./sample_fwf-1.txt', widths=col_widths, names=col_names, converters=col_conv)
print df1
for k in df1.keys():
print k + ' / ' + str(df1[k][0].__class__)
print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16'
df2 = pd.read_fwf('file:./sample_fwf-2.txt', widths=col_widths, names=col_names, converters=col_conv)
print df2
for k in df2.keys():
print k + ' / ' + str(df2[k][0].__class__)
You will still need to get the sample data files off of the git repo I indicated in the previous post.
I would say that your col widths are off. This is the tricky thing with a fixed format, you need to get it exactly right, and w/o headers its hard.
In [35]: pd.read_fwf(StringIO(data), widths=col_widths, names=col_names, converters=col_conv)
Out[35]:
AGE AGE_NEONATE AMONTH AWEEKEND DIED DISCWT DISPUNIFORM DQTR DRG DRG24 DRGVER DRG_NoPOA \
0 24- 9 1 0 0 4.9999028 1 17 653 712 97 656
1 0 1 11 0 0 4.9999028 1 4 794 390 30 794
2 0 1 6 0 0 4.9999028 1 2 794 390 29 794
3 79 -9 3 0 0 4.9999028 1 1 287 125 29 287
4 55 -9 5 1 0 4.9999028 7 2 948 464 29 948
5 77 -9 12 0 0 4.9999028 6 4 947 463 30 947
6 58 -9 8 0 0 4.9999028 1 3 331 149 29 331
7 70 -9 12 0 0 4.9999028 1 4 287 124 30 287
8 63 -9 11 0 0 4.9999028 1 4 310 138 30 310
9 30 -9 9 0 0 4.9999028 1 3 775 373 29 775
10 64 -9 3 0 0 4.9999028 1 1 417 493 29 417
11 73 -9 4 0 0 4.9999028 1 2 872 576 29 872
12 82 -9 10 0 0 4.9999028 1 4 281 121 30 281
13 0 0 2 0 0 4.9999028 1 1 153 70 29 153
14 74 -9 5 0 0 4.9999028 1 2 208 566 29 208
DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 DX9 DX10 DX11 DX12 DX13 \
0 60016 47616 52216 48910 549 4 9390V 270 NaN NaN NaN NaN NaN NaN
1 V3001 V7219 V053 77989 6039 NaN NaN NaN NaN NaN NaN NaN NaN
2 V3000 V290 76408 76529 7746 75732 NaN NaN NaN NaN NaN NaN NaN
3 42731 4019 53081 7840 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 33819 33829 72252 4019 2768 NaN NaN NaN NaN NaN NaN NaN NaN
5 78097 34831 3099 2809 V4364 71590 53081 27651 NaN NaN NaN NaN NaN
6 56211 5531 58-9 8 0 0 4. 99990 28 1 33311 49293 31562 11553 1 58 -9 8
7 4255 4282 70-91 2 0 0 4.9 99902 8 1 4 28712 43028 74255 4282 70-9 12 0
8 42731 25000 4019 4280 27801 60784 V1582 V8537 NaN NaN NaN NaN NaN
9 64891 64821 2859 V0251 V270 NaN NaN NaN NaN NaN NaN NaN NaN
10 5750 0389 42732 99590 2930 42731 28860 NaN NaN NaN NaN NaN NaN
11 0389 5849 5990 99591 27651 4019 2768 NaN NaN NaN NaN NaN NaN
12 41071 5849 41400 V4581 25000 V5867 V5866 53081 2724 40490 5859 V142 27651
13 3829 49390 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 51881 4282 74-9 5 0 0 4.9 99902 8 1 2 20856 62920 85188 14282 74-9 5 0
DX14 DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 0 0 4.999 9028 1 333 11492 93315 62115 531 58-9 8 0 0 4.9000 99902
7 0 4. 99990 28 1 42871 24302 87425 5 428 2 70- 912 0 0 4 0.9999 028 1
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 412 30000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 0 4. 99990 28 1 22085 66292 8518 81428 2 74- 9 5 0 0 4 0.9999 028 1
I can see that your column widths are off for some reason. My results are properly aligned all the way out to 141 columns. I truncated this data set just to demonstrate the issue.
best if you have a small reproducible example. Otherwise to see what the issue is.
I put the code out there in Git. I've run it on two platforms (OS X and Ubuntu in python 2.7) and it get the same erroneous results.
can you show what is erroneous. I dont understand what the problem is.
FYI The idea is to make it easy to see a bug, if people have to jump thru hoops they won't be bothered.
Is it possible that your data file got misaligned with all the cutting and pasting you were doing? I will post the output as soon as I figure out how to post up here in a fixed font :)
that's why its best to post an in-line example e.g. something like
data = """AAABBBCD
ajslfljaslj
""""
pd.read_fwf(StringIO(data), ....)
then its just a copy-paste
Ok. Let me see what I can do to make this even easier.
@LoneStar134 the easier you make it the better.
you don't need conv_str
, just use str
(its a function as well)
import pandas as pd
import StringIO as io
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 500)
data1 = ' 24-9 1 0 0 4.9999028 1 176537129765660016476165221648910549 49390V270 \n 0 111 0 0 4.9999028 1 479439030794V3001V7219V053 779896039 \n'
data2 = ' 82-910 0 0 4.9999028 1 428112130281410715849 41400V458125000V5867V5866530812724 404905859 V142 27651412 30000 \n 0 0 2 0 0 4.9999028 1 1153 70291533829 49390 \n'
col_widths = [3, 2, 2, 2, 2, 11, 2, 2, 3, 3, 2, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
col_names = ['AGE', 'AGE_NEONATE', 'AMONTH', 'AWEEKEND', 'DIED', 'DISCWT', 'DISPUNIFORM', 'DQTR', 'DRG', 'DRG24', 'DRGVER', 'DRG_NoPOA', 'DX1', 'DX2', 'DX3', 'DX4', 'DX5', 'DX6', 'DX7', 'DX8', 'DX9', 'DX10', 'DX11', 'DX12', 'DX13', 'DX14', 'DX15', 'DX16', 'DX17', 'DX18', 'DX19', 'DX20', 'DX21', 'DX22', 'DX23', 'DX24', 'DX25']
col_conv = {'AGE': str, 'AGE_NEONATE': str, 'AMONTH': str, 'AWEEKEND': str, 'DIED': str, 'DISCWT': str, 'DISPUNIFORM': str, 'DQTR': str, 'DRG': str, 'DRG24': str, 'DRGVER': str, 'DRG_NoPOA': str, 'DX1': str, 'DX2': str, 'DX3': str, 'DX4': str, 'DX5': str, 'DX6': str, 'DX7': str, 'DX8': str, 'DX9': str, 'DX10': str, 'DX11': str, 'DX12': str, 'DX13': str, 'DX14': str, 'DX15': str, 'DX16': str, 'DX17': str, 'DX18': str, 'DX19': str, 'DX20': str, 'DX21': str, 'DX22': str, 'DX23': str, 'DX24': str, 'DX25': str}
print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8'
df1 = pd.read_fwf(io.StringIO(data1), widths=col_widths, names=col_names, converters=col_conv)
print df1
for k in df1.keys():
print k + ' / ' + str(df1[k][0].__class__)
print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES OF THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16'
df2 = pd.read_fwf(io.StringIO(data2), widths=col_widths, names=col_names, converters=col_conv)
print df2
for k in df2.keys():
print k + ' / ' + str(df2[k][0].__class__)
ok, so what does 'fails to properly coerce' mean? you have blank data in those positions. so you get a NaN
. what are you expecting?
I am expecting a String type, not Float.
ok, we could have skipped this whole discussion then. you always what to say hey, here is the what I am getting but I am expecting something else. Well NaN
is the marker for missing values. If they all are missing it will be all NaN
. If its all NaN
it would normally be float, but you specified a converter so it should string be object
.
So I believe this is what you are doing, and it seems to work. So see if you can show why your example is different.
In [9]: pd.read_fwf(StringIO("A \nB "),widths=[1,1],names=['first','second'],converters={'first' : str, 'second' : str})
Out[9]:
first second
0 A NaN
1 B NaN
In [10]: pd.read_fwf(StringIO("A \nB "),widths=[1,1],names=['first','second'],converters={'first' : str, 'second' : str}).dtypes
Out[10]:
first object
second object
dtype: object
I'm Ok with the NaN's. What I care about the actual types associated with each of the columns which is demonstrated by my code when it lists the types for each of the columns. You will see in the first example that it is String up until DX8 and then it becomes Float. In the second example you will see that it is String up until DX15 and then it becomes Float.
In the 1st example, I make the first line of the data1 an example that has data in all columns up to DX7. The result is that all fields are set to String up to DX7 as expected, but then every field after that is a Float because the first row of data had blanks for DX8 and beyond.
In the 2nd example, I make the first line of the data2 an example that has data in all columns up to DX15. And consistent with the first example, all fields are set to String up to DX15 as expected, but then every field after that is a Float because the first row of data had blanks for DX15 and beyond.
I expect all fields to be set to String based on my converters dictionary specifying that every column should be converted by the 'str' function.
If you run my code, it clearly demonstrates this. You must look at the vertical listing of columns that are output after each data frame because it displays the types of each of the columns for the data frame above it.
THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES OF THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8. SHOULD BE SET TO STRING BUT INSTEAD WAS SET TO FLOAT.
AGE AGE_NEONATE AMONTH AWEEKEND DIED DISCWT DISPUNIFORM DQTR DRG DRG24 DRGVER DRG_NoPOA DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 DX9 DX10 DX11 DX12 DX13 DX14 DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0 24 -9 1 0 0 4.9999028 1 1 765 371 29 765 66001 64761 65221 64891 0549 49390 V270 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0 1 11 0 0 4.9999028 1 4 794 390 30 794 V3001 V7219 V053 77989 6039 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'float'>
DX9 / <type 'float'>
DX10 / <type 'float'>
DX11 / <type 'float'>
DX12 / <type 'float'>
DX13 / <type 'float'>
DX14 / <type 'float'>
DX15 / <type 'float'>
DX16 / <type 'float'>
DX17 / <type 'float'>
DX18 / <type 'float'>
DX19 / <type 'float'>
DX20 / <type 'float'>
DX21 / <type 'float'>
DX22 / <type 'float'>
DX23 / <type 'float'>
DX24 / <type 'float'>
DX25 / <type 'float'>
THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES OF THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16. SHOULD BE SET TO STRING BUT INSTEAD WAS SET TO FLOAT.
AGE AGE_NEONATE AMONTH AWEEKEND DIED DISCWT DISPUNIFORM DQTR DRG DRG24 DRGVER DRG_NoPOA DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 DX9 DX10 DX11 DX12 DX13 DX14 DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0 82 -9 10 0 0 4.9999028 1 4 281 121 30 281 41071 5849 41400 V4581 25000 V5867 V5866 53081 2724 40490 5859 V142 27651 412 30000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0 0 2 0 0 4.9999028 1 1 153 70 29 153 3829 49390 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'str'>
DX9 / <type 'str'>
DX10 / <type 'str'>
DX11 / <type 'str'>
DX12 / <type 'str'>
DX13 / <type 'str'>
DX14 / <type 'str'>
DX15 / <type 'str'>
DX16 / <type 'float'>
DX17 / <type 'float'>
DX18 / <type 'float'>
DX19 / <type 'float'>
DX20 / <type 'float'>
DX21 / <type 'float'>
DX22 / <type 'float'>
DX23 / <type 'float'>
DX24 / <type 'float'>
DX25 / <type 'float'>
Process finished with exit code 0
You can better visualize the alignment of the sampe data sets (2 rows in each) in the following. This is as if you were viewing the data in an editor with the first line on top and 2nd line on bottom.
data1:
24-9 1 0 0 4.9999028 1 176537129765660016476165221648910549 49390V270
0 111 0 0 4.9999028 1 479439030794V3001V7219V053 779896039
data2:
82-910 0 0 4.9999028 1 428112130281410715849 41400V458125000V5867V5866530812724 404905859 V142 27651412 30000
0 0 2 0 0 4.9999028 1 1153 70291533829 49390
FIXED! It just occurred to me that maybe the blanks were not being handled properly by pandas at some point in the process. So, I rewrote so that conv_str is called and, if the value was blank, then return a single space ' '
. I consider this a work-around for some unintuitive behavior in pandas. Pandas should comply with the conversions specified by the converters regardless of the data passed. But it appears that somewhere in the process, the spaces are being converted to the NULL string ''
and pandas then chooses to ignore the converter for that column. Unless someone can explain to me why a String should become a Float in this scenario, I think this should be looked at as a bug.
Code and result follows:
import pandas as pd
import StringIO as io
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 500)
def conv_str(x):
if x.replace(' ','') == '':
return ' '
else:
return str(x)
data1 = ' 24-9 1 0 0 4.9999028 1 176537129765660016476165221648910549 49390V270 \n 0 111 0 0 4.9999028 1 479439030794V3001V7219V053 779896039 \n'
data2 = ' 82-910 0 0 4.9999028 1 428112130281410715849 41400V458125000V5867V5866530812724 404905859 V142 27651412 30000 \n 0 0 2 0 0 4.9999028 1 1153 70291533829 49390 \n'
col_widths = [3, 2, 2, 2, 2, 11, 2, 2, 3, 3, 2, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
col_names = ['AGE', 'AGE_NEONATE', 'AMONTH', 'AWEEKEND', 'DIED', 'DISCWT', 'DISPUNIFORM', 'DQTR', 'DRG', 'DRG24', 'DRGVER', 'DRG_NoPOA', 'DX1', 'DX2', 'DX3', 'DX4', 'DX5', 'DX6', 'DX7', 'DX8', 'DX9', 'DX10', 'DX11', 'DX12', 'DX13', 'DX14', 'DX15', 'DX16', 'DX17', 'DX18', 'DX19', 'DX20', 'DX21', 'DX22', 'DX23', 'DX24', 'DX25']
col_conv = {'AGE': conv_str, 'AGE_NEONATE': conv_str, 'AMONTH': conv_str, 'AWEEKEND': conv_str, 'DIED': conv_str, 'DISCWT': conv_str, 'DISPUNIFORM': conv_str, 'DQTR': conv_str, 'DRG': conv_str, 'DRG24': conv_str, 'DRGVER': conv_str, 'DRG_NoPOA': conv_str, 'DX1': conv_str, 'DX2': conv_str, 'DX3': conv_str, 'DX4': conv_str, 'DX5': conv_str, 'DX6': conv_str, 'DX7': conv_str, 'DX8': conv_str, 'DX9': conv_str, 'DX10': conv_str, 'DX11': conv_str, 'DX12': conv_str, 'DX13': conv_str, 'DX14': conv_str, 'DX15': conv_str, 'DX16': conv_str, 'DX17': conv_str, 'DX18': conv_str, 'DX19': conv_str, 'DX20': conv_str, 'DX21': conv_str, 'DX22': conv_str, 'DX23': conv_str, 'DX24': conv_str, 'DX25': conv_str}
print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8'
df1 = pd.read_fwf(io.StringIO(data1), widths=col_widths, names=col_names, converters=col_conv)
print df1
for k in df1.keys():
print k + ' / ' + str(df1[k][0].__class__)
print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16'
df2 = pd.read_fwf(io.StringIO(data2), widths=col_widths, names=col_names, converters=col_conv)
print df2
for k in df2.keys():
print k + ' / ' + str(df2[k][0].__class__)
OUTPUT:
THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8
AGE AGE_NEONATE AMONTH AWEEKEND DIED DISCWT DISPUNIFORM DQTR DRG DRG24 DRGVER DRG_NoPOA DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 DX9 DX10 DX11 DX12 DX13 DX14 DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0 24 -9 1 0 0 4.9999028 1 1 765 371 29 765 66001 64761 65221 64891 0549 49390 V270
1 0 1 11 0 0 4.9999028 1 4 794 390 30 794 V3001 V7219 V053 77989 6039
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'str'>
DX9 / <type 'str'>
DX10 / <type 'str'>
DX11 / <type 'str'>
DX12 / <type 'str'>
DX13 / <type 'str'>
DX14 / <type 'str'>
DX15 / <type 'str'>
DX16 / <type 'str'>
DX17 / <type 'str'>
DX18 / <type 'str'>
DX19 / <type 'str'>
DX20 / <type 'str'>
DX21 / <type 'str'>
DX22 / <type 'str'>
DX23 / <type 'str'>
DX24 / <type 'str'>
DX25 / <type 'str'>
THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16
AGE AGE_NEONATE AMONTH AWEEKEND DIED DISCWT DISPUNIFORM DQTR DRG DRG24 DRGVER DRG_NoPOA DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 DX9 DX10 DX11 DX12 DX13 DX14 DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0 82 -9 10 0 0 4.9999028 1 4 281 121 30 281 41071 5849 41400 V4581 25000 V5867 V5866 53081 2724 40490 5859 V142 27651 412 30000
1 0 0 2 0 0 4.9999028 1 1 153 70 29 153 3829 49390
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'str'>
DX9 / <type 'str'>
DX10 / <type 'str'>
DX11 / <type 'str'>
DX12 / <type 'str'>
DX13 / <type 'str'>
DX14 / <type 'str'>
DX15 / <type 'str'>
DX16 / <type 'str'>
DX17 / <type 'str'>
DX18 / <type 'str'>
DX19 / <type 'str'>
DX20 / <type 'str'>
DX21 / <type 'str'>
DX22 / <type 'str'>
DX23 / <type 'str'>
DX24 / <type 'str'>
DX25 / <type 'str'>
Process finished with exit code 0
I am trying to read a fixed width file using read_fwf function and coerce the column data types by using the 'converters' parameter. I created a dictionary that specifies a function that converts to the type that I desire for each of the columns in the data file.
The problem is that not all columns are getting set to the type I specify and it appears that any column for which the first few rows of the data set are blank will be inferred or defaulted and not coerced to the data type I explicitly specified in the dictionary passed via the converters parameter.
As a test, I tried to coerce all column types to a string by passing a dictionary that looked something like this:
type_dict = {0: conv_str, 1: conv_str, 2: conv_str ... (n-1): conv_str), where n = number of columns
and conv_str is a function that is defined as follows:
def conv_str(x): return str(x)
As previously explained, in the resulting data frame, all columns get converted to type string as desired, with the exception of the columns that had blank values for the first few rows of the data set. Those columns that had blank values get defaulted to the 'float' data type.