read_fwf will not convert columns to specified data type when initial row instances are blank

LoneStar134 commented 9 years ago

I am trying to read a fixed width file using read_fwf function and coerce the column data types by using the 'converters' parameter. I created a dictionary that specifies a function that converts to the type that I desire for each of the columns in the data file.

The problem is that not all columns are getting set to the type I specify and it appears that any column for which the first few rows of the data set are blank will be inferred or defaulted and not coerced to the data type I explicitly specified in the dictionary passed via the converters parameter.

As a test, I tried to coerce all column types to a string by passing a dictionary that looked something like this:

type_dict = {0: conv_str, 1: conv_str, 2: conv_str ... (n-1): conv_str), where n = number of columns

and conv_str is a function that is defined as follows:

def conv_str(x): return str(x)

As previously explained, in the resulting data frame, all columns get converted to type string as desired, with the exception of the columns that had blank values for the first few rows of the data set. Those columns that had blank values get defaulted to the 'float' data type.

jreback commented 9 years ago

can u post a copy pastable example and pd.show_versions()

LoneStar134 commented 9 years ago

Thanks for the quick response.

See the following Git: https://github.com/LoneStar134/read_fwf_issue

show_versions() output as requested:

In[4]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.7.final.0 python-bits: 64 OS: Darwin OS-release: 14.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None

pandas: 0.16.2 nose: 1.3.3 Cython: 0.20.1 numpy: 1.8.1 scipy: 0.14.0 statsmodels: 0.5.0 IPython: 2.1.0 sphinx: 1.2.2 patsy: 0.2.1 dateutil: 1.5 pytz: 2014.3 bottleneck: None tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.3.1 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.5.5 lxml: 3.3.5 bs4: 4.3.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 0.9.4 pymysql: None psycopg2: None

jreback commented 9 years ago

need you to paste code inline that I can just copy paste to try

LoneStar134 commented 9 years ago

import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 500)

def conv_str(x):
    return str(x)

col_widths = [3, 2, 2, 2, 2, 11, 2, 2, 3, 3, 2, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
col_names = ['AGE', 'AGE_NEONATE', 'AMONTH', 'AWEEKEND', 'DIED', 'DISCWT', 'DISPUNIFORM', 'DQTR', 'DRG', 'DRG24', 'DRGVER', 'DRG_NoPOA', 'DX1', 'DX2', 'DX3', 'DX4', 'DX5', 'DX6', 'DX7', 'DX8', 'DX9', 'DX10', 'DX11', 'DX12', 'DX13', 'DX14', 'DX15', 'DX16', 'DX17', 'DX18', 'DX19', 'DX20', 'DX21', 'DX22', 'DX23', 'DX24', 'DX25']
col_conv = {'AGE': conv_str, 'AGE_NEONATE': conv_str, 'AMONTH': conv_str, 'AWEEKEND': conv_str, 'DIED': conv_str, 'DISCWT': conv_str, 'DISPUNIFORM': conv_str, 'DQTR': conv_str, 'DRG': conv_str, 'DRG24': conv_str, 'DRGVER': conv_str, 'DRG_NoPOA': conv_str, 'DX1': conv_str, 'DX2': conv_str, 'DX3': conv_str, 'DX4': conv_str, 'DX5': conv_str, 'DX6': conv_str, 'DX7': conv_str, 'DX8': conv_str, 'DX9': conv_str, 'DX10': conv_str, 'DX11': conv_str, 'DX12': conv_str, 'DX13': conv_str, 'DX14': conv_str, 'DX15': conv_str, 'DX16': conv_str, 'DX17': conv_str, 'DX18': conv_str, 'DX19': conv_str, 'DX20': conv_str, 'DX21': conv_str, 'DX22': conv_str, 'DX23': conv_str, 'DX24': conv_str, 'DX25': conv_str}

print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8'
df1 = pd.read_fwf('file:./sample_fwf-1.txt', widths=col_widths, names=col_names, converters=col_conv)

print df1
for k in df1.keys():
    print k + ' / ' + str(df1[k][0].__class__)

print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16'
df2 = pd.read_fwf('file:./sample_fwf-2.txt', widths=col_widths, names=col_names, converters=col_conv)

print df2
for k in df2.keys():
    print k + ' / ' + str(df2[k][0].__class__)

LoneStar134 commented 9 years ago

You will still need to get the sample data files off of the git repo I indicated in the previous post.

jreback commented 9 years ago

I would say that your col widths are off. This is the tricky thing with a fixed format, you need to get it exactly right, and w/o headers its hard.

In [35]: pd.read_fwf(StringIO(data), widths=col_widths, names=col_names, converters=col_conv)
Out[35]: 
    AGE AGE_NEONATE AMONTH AWEEKEND DIED     DISCWT DISPUNIFORM DQTR  DRG DRG24 DRGVER DRG_NoPOA  \
0   24-           9      1        0    0  4.9999028           1   17  653   712     97       656   
1     0           1     11        0    0  4.9999028           1    4  794   390     30       794   
2     0           1      6        0    0  4.9999028           1    2  794   390     29       794   
3    79          -9      3        0    0  4.9999028           1    1  287   125     29       287   
4    55          -9      5        1    0  4.9999028           7    2  948   464     29       948   
5    77          -9     12        0    0  4.9999028           6    4  947   463     30       947   
6    58          -9      8        0    0  4.9999028           1    3  331   149     29       331   
7    70          -9     12        0    0  4.9999028           1    4  287   124     30       287   
8    63          -9     11        0    0  4.9999028           1    4  310   138     30       310   
9    30          -9      9        0    0  4.9999028           1    3  775   373     29       775   
10   64          -9      3        0    0  4.9999028           1    1  417   493     29       417   
11   73          -9      4        0    0  4.9999028           1    2  872   576     29       872   
12   82          -9     10        0    0  4.9999028           1    4  281   121     30       281   
13    0           0      2        0    0  4.9999028           1    1  153    70     29       153   
14   74          -9      5        0    0  4.9999028           1    2  208   566     29       208   

      DX1    DX2    DX3    DX4    DX5    DX6    DX7    DX8    DX9   DX10   DX11   DX12   DX13  \
0   60016  47616  52216  48910  549 4  9390V    270    NaN    NaN    NaN    NaN    NaN    NaN   
1   V3001  V7219   V053  77989   6039    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
2   V3000   V290  76408  76529   7746  75732    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3   42731   4019  53081   7840    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
4   33819  33829  72252   4019   2768    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
5   78097  34831   3099   2809  V4364  71590  53081  27651    NaN    NaN    NaN    NaN    NaN   
6   56211   5531   58-9    8 0  0  4.  99990   28 1  33311  49293  31562  11553  1  58   -9 8   
7    4255   4282  70-91  2 0 0    4.9  99902  8 1 4  28712  43028  74255   4282   70-9   12 0   
8   42731  25000   4019   4280  27801  60784  V1582  V8537    NaN    NaN    NaN    NaN    NaN   
9   64891  64821   2859  V0251   V270    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
10   5750   0389  42732  99590   2930  42731  28860    NaN    NaN    NaN    NaN    NaN    NaN   
11   0389   5849   5990  99591  27651   4019   2768    NaN    NaN    NaN    NaN    NaN    NaN   
12  41071   5849  41400  V4581  25000  V5867  V5866  53081   2724  40490   5859   V142  27651   
13   3829  49390    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
14  51881   4282   74-9  5 0 0    4.9  99902  8 1 2  20856  62920  85188  14282   74-9    5 0   

     DX14   DX15  DX16   DX17   DX18   DX19   DX20   DX21   DX22   DX23    DX24   DX25  
0     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
1     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
2     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
3     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
4     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
5     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
6     0 0  4.999  9028  1 333  11492  93315  62115    531   58-9  8 0 0  4.9000  99902  
7   0  4.  99990  28 1  42871  24302  87425  5 428  2 70-  912 0   0  4  0.9999  028 1  
8     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
9     NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
10    NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
11    NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
12    412  30000   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
13    NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN    NaN  
14  0  4.  99990  28 1  22085  66292   8518  81428  2 74-  9 5 0   0  4  0.9999  028 1

LoneStar134 commented 9 years ago

I can see that your column widths are off for some reason. My results are properly aligned all the way out to 141 columns. I truncated this data set just to demonstrate the issue.

jreback commented 9 years ago

best if you have a small reproducible example. Otherwise to see what the issue is.

LoneStar134 commented 9 years ago

I put the code out there in Git. I've run it on two platforms (OS X and Ubuntu in python 2.7) and it get the same erroneous results.

jreback commented 9 years ago

can you show what is erroneous. I dont understand what the problem is.

jreback commented 9 years ago

FYI The idea is to make it easy to see a bug, if people have to jump thru hoops they won't be bothered.

LoneStar134 commented 9 years ago

Is it possible that your data file got misaligned with all the cutting and pasting you were doing? I will post the output as soon as I figure out how to post up here in a fixed font :)

jreback commented 9 years ago

that's why its best to post an in-line example e.g. something like

data = """AAABBBCD
ajslfljaslj
""""

pd.read_fwf(StringIO(data), ....)

then its just a copy-paste

LoneStar134 commented 9 years ago

Ok. Let me see what I can do to make this even easier.

jreback commented 9 years ago

@LoneStar134 the easier you make it the better.

jreback commented 9 years ago

you don't need conv_str, just use str (its a function as well)

LoneStar134 commented 9 years ago

import pandas as pd
import StringIO as io
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 500)

data1 = ' 24-9 1 0 0  4.9999028 1 176537129765660016476165221648910549 49390V270                                                                                           \n  0 111 0 0  4.9999028 1 479439030794V3001V7219V053 779896039                                                                                                     \n'
data2 = ' 82-910 0 0  4.9999028 1 428112130281410715849 41400V458125000V5867V5866530812724 404905859 V142 27651412  30000                                                  \n  0 0 2 0 0  4.9999028 1 1153 70291533829 49390                                                                                                                   \n'

col_widths = [3, 2, 2, 2, 2, 11, 2, 2, 3, 3, 2, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
col_names = ['AGE', 'AGE_NEONATE', 'AMONTH', 'AWEEKEND', 'DIED', 'DISCWT', 'DISPUNIFORM', 'DQTR', 'DRG', 'DRG24', 'DRGVER', 'DRG_NoPOA', 'DX1', 'DX2', 'DX3', 'DX4', 'DX5', 'DX6', 'DX7', 'DX8', 'DX9', 'DX10', 'DX11', 'DX12', 'DX13', 'DX14', 'DX15', 'DX16', 'DX17', 'DX18', 'DX19', 'DX20', 'DX21', 'DX22', 'DX23', 'DX24', 'DX25']
col_conv = {'AGE': str, 'AGE_NEONATE': str, 'AMONTH': str, 'AWEEKEND': str, 'DIED': str, 'DISCWT': str, 'DISPUNIFORM': str, 'DQTR': str, 'DRG': str, 'DRG24': str, 'DRGVER': str, 'DRG_NoPOA': str, 'DX1': str, 'DX2': str, 'DX3': str, 'DX4': str, 'DX5': str, 'DX6': str, 'DX7': str, 'DX8': str, 'DX9': str, 'DX10': str, 'DX11': str, 'DX12': str, 'DX13': str, 'DX14': str, 'DX15': str, 'DX16': str, 'DX17': str, 'DX18': str, 'DX19': str, 'DX20': str, 'DX21': str, 'DX22': str, 'DX23': str, 'DX24': str, 'DX25': str}

print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8'
df1 = pd.read_fwf(io.StringIO(data1), widths=col_widths, names=col_names, converters=col_conv)

print df1
for k in df1.keys():
    print k + ' / ' + str(df1[k][0].__class__)

print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES OF THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16'
df2 = pd.read_fwf(io.StringIO(data2), widths=col_widths, names=col_names, converters=col_conv)

print df2
for k in df2.keys():
    print k + ' / ' + str(df2[k][0].__class__)

jreback commented 9 years ago

ok, so what does 'fails to properly coerce' mean? you have blank data in those positions. so you get a NaN. what are you expecting?

LoneStar134 commented 9 years ago

I am expecting a String type, not Float.

jreback commented 9 years ago

ok, we could have skipped this whole discussion then. you always what to say hey, here is the what I am getting but I am expecting something else. Well NaN is the marker for missing values. If they all are missing it will be all NaN. If its all NaN it would normally be float, but you specified a converter so it should string be object.

So I believe this is what you are doing, and it seems to work. So see if you can show why your example is different.

In [9]: pd.read_fwf(StringIO("A \nB "),widths=[1,1],names=['first','second'],converters={'first' : str, 'second' : str})
Out[9]: 
  first second
0     A    NaN
1     B    NaN

In [10]: pd.read_fwf(StringIO("A \nB "),widths=[1,1],names=['first','second'],converters={'first' : str, 'second' : str}).dtypes
Out[10]: 
first     object
second    object
dtype: object

LoneStar134 commented 9 years ago

I'm Ok with the NaN's. What I care about the actual types associated with each of the columns which is demonstrated by my code when it lists the types for each of the columns. You will see in the first example that it is String up until DX8 and then it becomes Float. In the second example you will see that it is String up until DX15 and then it becomes Float.

In the 1st example, I make the first line of the data1 an example that has data in all columns up to DX7. The result is that all fields are set to String up to DX7 as expected, but then every field after that is a Float because the first row of data had blanks for DX8 and beyond.

In the 2nd example, I make the first line of the data2 an example that has data in all columns up to DX15. And consistent with the first example, all fields are set to String up to DX15 as expected, but then every field after that is a Float because the first row of data had blanks for DX15 and beyond.

I expect all fields to be set to String based on my converters dictionary specifying that every column should be converted by the 'str' function.

If you run my code, it clearly demonstrates this. You must look at the vertical listing of columns that are output after each data frame because it displays the types of each of the columns for the data frame above it.

THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES OF THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8.  SHOULD BE SET TO STRING BUT INSTEAD WAS SET TO FLOAT.
  AGE AGE_NEONATE AMONTH AWEEKEND DIED     DISCWT DISPUNIFORM DQTR  DRG DRG24 DRGVER DRG_NoPOA    DX1    DX2    DX3    DX4   DX5    DX6   DX7  DX8  DX9 DX10 DX11 DX12 DX13 DX14 DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0  24          -9      1        0    0  4.9999028           1    1  765   371     29       765  66001  64761  65221  64891  0549  49390  V270  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1   0           1     11        0    0  4.9999028           1    4  794   390     30       794  V3001  V7219   V053  77989  6039    NaN   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'float'>
DX9 / <type 'float'>
DX10 / <type 'float'>
DX11 / <type 'float'>
DX12 / <type 'float'>
DX13 / <type 'float'>
DX14 / <type 'float'>
DX15 / <type 'float'>
DX16 / <type 'float'>
DX17 / <type 'float'>
DX18 / <type 'float'>
DX19 / <type 'float'>
DX20 / <type 'float'>
DX21 / <type 'float'>
DX22 / <type 'float'>
DX23 / <type 'float'>
DX24 / <type 'float'>
DX25 / <type 'float'>
THE FOLLOWING EXAMPLE FAILS TO PROPERLY SET THE DATA TYPES OF THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16.  SHOULD BE SET TO STRING BUT INSTEAD WAS SET TO FLOAT.
  AGE AGE_NEONATE AMONTH AWEEKEND DIED     DISCWT DISPUNIFORM DQTR  DRG DRG24 DRGVER DRG_NoPOA    DX1    DX2    DX3    DX4    DX5    DX6    DX7    DX8   DX9   DX10  DX11  DX12   DX13 DX14   DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0  82          -9     10        0    0  4.9999028           1    4  281   121     30       281  41071   5849  41400  V4581  25000  V5867  V5866  53081  2724  40490  5859  V142  27651  412  30000  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1   0           0      2        0    0  4.9999028           1    1  153    70     29       153   3829  49390    NaN    NaN    NaN    NaN    NaN    NaN   NaN    NaN   NaN   NaN    NaN  NaN    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'str'>
DX9 / <type 'str'>
DX10 / <type 'str'>
DX11 / <type 'str'>
DX12 / <type 'str'>
DX13 / <type 'str'>
DX14 / <type 'str'>
DX15 / <type 'str'>
DX16 / <type 'float'>
DX17 / <type 'float'>
DX18 / <type 'float'>
DX19 / <type 'float'>
DX20 / <type 'float'>
DX21 / <type 'float'>
DX22 / <type 'float'>
DX23 / <type 'float'>
DX24 / <type 'float'>
DX25 / <type 'float'>

Process finished with exit code 0

LoneStar134 commented 9 years ago

You can better visualize the alignment of the sampe data sets (2 rows in each) in the following. This is as if you were viewing the data in an editor with the first line on top and 2nd line on bottom.

data1:

 24-9 1 0 0  4.9999028 1 176537129765660016476165221648910549 49390V270                                                                                           
  0 111 0 0  4.9999028 1 479439030794V3001V7219V053 779896039

data2:

 82-910 0 0  4.9999028 1 428112130281410715849 41400V458125000V5867V5866530812724 404905859 V142 27651412  30000                                                  
  0 0 2 0 0  4.9999028 1 1153 70291533829 49390

LoneStar134 commented 9 years ago

FIXED! It just occurred to me that maybe the blanks were not being handled properly by pandas at some point in the process. So, I rewrote so that conv_str is called and, if the value was blank, then return a single space ' '. I consider this a work-around for some unintuitive behavior in pandas. Pandas should comply with the conversions specified by the converters regardless of the data passed. But it appears that somewhere in the process, the spaces are being converted to the NULL string '' and pandas then chooses to ignore the converter for that column. Unless someone can explain to me why a String should become a Float in this scenario, I think this should be looked at as a bug.

Code and result follows:

import pandas as pd
import StringIO as io
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 500)

def conv_str(x):
    if x.replace(' ','') == '':
        return ' '
    else:
        return str(x)

data1 = ' 24-9 1 0 0  4.9999028 1 176537129765660016476165221648910549 49390V270                                                                                           \n  0 111 0 0  4.9999028 1 479439030794V3001V7219V053 779896039                                                                                                     \n'
data2 = ' 82-910 0 0  4.9999028 1 428112130281410715849 41400V458125000V5867V5866530812724 404905859 V142 27651412  30000                                                  \n  0 0 2 0 0  4.9999028 1 1153 70291533829 49390                                                                                                                   \n'

col_widths = [3, 2, 2, 2, 2, 11, 2, 2, 3, 3, 2, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
col_names = ['AGE', 'AGE_NEONATE', 'AMONTH', 'AWEEKEND', 'DIED', 'DISCWT', 'DISPUNIFORM', 'DQTR', 'DRG', 'DRG24', 'DRGVER', 'DRG_NoPOA', 'DX1', 'DX2', 'DX3', 'DX4', 'DX5', 'DX6', 'DX7', 'DX8', 'DX9', 'DX10', 'DX11', 'DX12', 'DX13', 'DX14', 'DX15', 'DX16', 'DX17', 'DX18', 'DX19', 'DX20', 'DX21', 'DX22', 'DX23', 'DX24', 'DX25']
col_conv = {'AGE': conv_str, 'AGE_NEONATE': conv_str, 'AMONTH': conv_str, 'AWEEKEND': conv_str, 'DIED': conv_str, 'DISCWT': conv_str, 'DISPUNIFORM': conv_str, 'DQTR': conv_str, 'DRG': conv_str, 'DRG24': conv_str, 'DRGVER': conv_str, 'DRG_NoPOA': conv_str, 'DX1': conv_str, 'DX2': conv_str, 'DX3': conv_str, 'DX4': conv_str, 'DX5': conv_str, 'DX6': conv_str, 'DX7': conv_str, 'DX8': conv_str, 'DX9': conv_str, 'DX10': conv_str, 'DX11': conv_str, 'DX12': conv_str, 'DX13': conv_str, 'DX14': conv_str, 'DX15': conv_str, 'DX16': conv_str, 'DX17': conv_str, 'DX18': conv_str, 'DX19': conv_str, 'DX20': conv_str, 'DX21': conv_str, 'DX22': conv_str, 'DX23': conv_str, 'DX24': conv_str, 'DX25': conv_str}

print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8'
df1 = pd.read_fwf(io.StringIO(data1), widths=col_widths, names=col_names, converters=col_conv)

print df1
for k in df1.keys():
    print k + ' / ' + str(df1[k][0].__class__)

print 'THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16'
df2 = pd.read_fwf(io.StringIO(data2), widths=col_widths, names=col_names, converters=col_conv)

print df2
for k in df2.keys():
    print k + ' / ' + str(df2[k][0].__class__)

OUTPUT:

THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX8
  AGE AGE_NEONATE AMONTH AWEEKEND DIED     DISCWT DISPUNIFORM DQTR  DRG DRG24 DRGVER DRG_NoPOA    DX1    DX2    DX3    DX4   DX5    DX6   DX7 DX8 DX9 DX10 DX11 DX12 DX13 DX14 DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0  24          -9      1        0    0  4.9999028           1    1  765   371     29       765  66001  64761  65221  64891  0549  49390  V270                                                                                        
1   0           1     11        0    0  4.9999028           1    4  794   390     30       794  V3001  V7219   V053  77989  6039                                                                                                     
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'str'>
DX9 / <type 'str'>
DX10 / <type 'str'>
DX11 / <type 'str'>
DX12 / <type 'str'>
DX13 / <type 'str'>
DX14 / <type 'str'>
DX15 / <type 'str'>
DX16 / <type 'str'>
DX17 / <type 'str'>
DX18 / <type 'str'>
DX19 / <type 'str'>
DX20 / <type 'str'>
DX21 / <type 'str'>
DX22 / <type 'str'>
DX23 / <type 'str'>
DX24 / <type 'str'>
DX25 / <type 'str'>
THE FOLLOWING EXAMPLE FAILS TO PROPERLY COERCE THE COLUMNS WITH BLANK DATA VALUES STARTING AT COLUMN DX16
  AGE AGE_NEONATE AMONTH AWEEKEND DIED     DISCWT DISPUNIFORM DQTR  DRG DRG24 DRGVER DRG_NoPOA    DX1    DX2    DX3    DX4    DX5    DX6    DX7    DX8   DX9   DX10  DX11  DX12   DX13 DX14   DX15 DX16 DX17 DX18 DX19 DX20 DX21 DX22 DX23 DX24 DX25
0  82          -9     10        0    0  4.9999028           1    4  281   121     30       281  41071   5849  41400  V4581  25000  V5867  V5866  53081  2724  40490  5859  V142  27651  412  30000                                                  
1   0           0      2        0    0  4.9999028           1    1  153    70     29       153   3829  49390                                                                                                                                        
AGE / <type 'str'>
AGE_NEONATE / <type 'str'>
AMONTH / <type 'str'>
AWEEKEND / <type 'str'>
DIED / <type 'str'>
DISCWT / <type 'str'>
DISPUNIFORM / <type 'str'>
DQTR / <type 'str'>
DRG / <type 'str'>
DRG24 / <type 'str'>
DRGVER / <type 'str'>
DRG_NoPOA / <type 'str'>
DX1 / <type 'str'>
DX2 / <type 'str'>
DX3 / <type 'str'>
DX4 / <type 'str'>
DX5 / <type 'str'>
DX6 / <type 'str'>
DX7 / <type 'str'>
DX8 / <type 'str'>
DX9 / <type 'str'>
DX10 / <type 'str'>
DX11 / <type 'str'>
DX12 / <type 'str'>
DX13 / <type 'str'>
DX14 / <type 'str'>
DX15 / <type 'str'>
DX16 / <type 'str'>
DX17 / <type 'str'>
DX18 / <type 'str'>
DX19 / <type 'str'>
DX20 / <type 'str'>
DX21 / <type 'str'>
DX22 / <type 'str'>
DX23 / <type 'str'>
DX24 / <type 'str'>
DX25 / <type 'str'>

Process finished with exit code 0

pandas-dev / pandas

read_fwf will not convert columns to specified data type when initial row instances are blank #10616

INSTALLED VERSIONS