pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.28k stars 17.8k forks source link

famafrench, pandas.io.data.DataReader error for files that do not end with line of length 2 #20610

Closed sarah-schroeder closed 6 years ago

sarah-schroeder commented 6 years ago

I ran into a minor issue reading data from the Fama-French data library in pandas 0.16.2 where for some files in the data library, the file's second edge is not detected. For example, the DataReader does not return the data from the file 'F-F_Research_Data_5_Factors_2x3_daily_TXT' on Ken French's website:

import pandas as pd
import pandas.io.data as web

five_factors = web.DataReader("F-F_Research_Data_5_Factors_2x3_daily_TXT", "famafrench")

The above returns an empty dictionary (no error message is triggered). It seems this result is due to the DataReader failing to detect the last line of the file as an edge as the file does not end with a line of length 2. Hence, only the first edge of the file is detected.

Here is the relevant piece of code from within the "get_data_famafrench(name)" function:

line_lengths = np.array(lmap(len, data))
file_edges = np.where(line_lengths == 2)[0]

Example showing only a single edge is detected:

print file_edges
(array([2], dtype=int64),)

Here is the output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.9.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None pandas: 0.16.2 nose: 1.3.7 Cython: 0.22.1 numpy: 1.8.2 scipy: 0.14.0 statsmodels: 0.6.1 IPython: 3.2.1 sphinx: 1.3.1 patsy: 0.3.0 dateutil: 2.4.2 pytz: 2015.4 bottleneck: 0.8.0 tables: 3.1.1 numexpr: 2.3.1 matplotlib: 1.4.3 openpyxl: 2.0.2 xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: 0.7.3 lxml: 3.4.4 bs4: 4.3.2 html5lib: 0.999 httplib2: 0.8 apiclient: None sqlalchemy: 1.0.6 pymysql: None psycopg2: None
TomAugspurger commented 6 years ago

0.16.2 is quite old. Since then, the remote data access parts have been split into a separate package: https://pypi.org/project/pandas-datareader/

Could you try with a newer pandas and pandas-datareader? It could be fixed already.

If not, it could be reported on https://github.com/pydata/pandas-datareader/issues

jreback commented 6 years ago

out of scope for pandas proper