rnelsonchem / simpledbf

A simple DBF file converter for Python3
BSD 3-Clause "New" or "Revised" License
30 stars 18 forks source link

No records on conversion #8

Open brianckeegan opened 6 years ago

brianckeegan commented 6 years ago

I'm working with the National Highway Transportation and Safety Administration's Fatality Analysis Reporting Systems which reports its data in DBF format from 1975-2016. The data starting in 2010 ( ftp://ftp.nhtsa.dot.gov/fars/2010/DBF/FARS2010.zip ) is read without error and includes appropriate-looking Dbf5 attributes, but none of the conversion methods produce data.

import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('accident.dbf',codec='latin1')
print("Number of records (raw): {0:,}".format(_dbf.numrec))
# 29,867
print("Number of records (DataFrame): {0:,}".format(len(_dbf.to_dataframe())))
# 0

Using other methods like to_csv and other codec encodings produces the same behavior: no data produced. I've reproduced this issue on both Mac and PC on Anaconda 4 running Python 3.6 in Jupyter Notebook.

rnelsonchem commented 6 years ago

@brianckeegan I confirmed your observation, and I found the source of your problems.

If you look at the _get_recs method the first byte is tested to see if the record is deleted. According to the specs, that first byte should be a space. If not, it is assumed that the record was deleted.

I modified my local version of the simpledbf file to print('No rec') if that flag is triggered, and when I try to convert that particular DBF file to a DataFrame, I get a bunch of "No rec"s printed.

My suggestion would be to comment out those two lines of your local simpledbf install; however, you might want to ping the folks that made that file to find out why all of the records are marked as deleted... Hope that helps.

rnelsonchem commented 6 years ago

If modifying your install is not in the cards, I was able to open that file with Excel 2016, and it seemed like all of the records were there. It is interesting, though, that Excel does not give any indication that the records were marked for deletion as per the specs...

Perhaps I'll need to add a "show deleted records" flag to the code... Once we switch over to 30h days, I'll use that new time to make some updates :)

brianckeegan commented 6 years ago

Thanks for looking into this. I don't think I'll be able to convince anyone at NHTSA to alter these data, but it would be great if there was an option when initializing in Dbf5 or using to_dataframe() to include all records, even those flagged as deleted like this case.