socialfoundations / folktables

Datasets derived from US census data
MIT License
234 stars 20 forks source link

Problem with get_data in Quick start example #15

Closed mrtobie closed 2 years ago

mrtobie commented 2 years ago

Hi there, I am very looking forward to study the data and therefore using your package.

I have a problem when using your quick start examples. It seems to be an issue with the downloaded .zip files (for example csv_pca.zip). I run the code just as suggested.

I tried to manually download the .zip-file and set download=False, but that lead to a FileNotFoundError.

Here is my .ipynb with the occuring error message:

pip install folktables
Requirement already satisfied: folktables in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (0.0.11)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: requests in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from folktables) (2.27.1)
Requirement already satisfied: numpy in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from folktables) (1.21.5)
Requirement already satisfied: pandas in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from folktables) (1.4.0)
Requirement already satisfied: sklearn in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from folktables) (0.0)
Requirement already satisfied: python-dateutil>=2.8.1 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from pandas->folktables) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from pandas->folktables) (2021.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from requests->folktables) (1.26.8)
Requirement already satisfied: certifi>=2017.4.17 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from requests->folktables) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from requests->folktables) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from requests->folktables) (2.0.11)
Requirement already satisfied: scikit-learn in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from sklearn->folktables) (1.0.2)
Requirement already satisfied: six>=1.5 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from python-dateutil>=2.8.1->pandas->folktables) (1.16.0)
Requirement already satisfied: joblib>=0.11 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from scikit-learn->sklearn->folktables) (1.1.0)
Requirement already satisfied: scipy>=1.1.0 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from scikit-learn->sklearn->folktables) (1.5.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in d:\programmierung\jupyter\envs\master_thesis_pre\lib\site-packages (from scikit-learn->sklearn->folktables) (3.1.0)
from folktables import ACSDataSource, ACSEmployment

data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
acs_data = data_source.get_data(states=["CA"], download=True)
features, label, group = ACSEmployment.df_to_numpy(acs_data)
Downloading data for 2018 1-Year person survey for CA...

data\2018\1-Year\csv_pca.zip may be corrupted. Please try deleting it and rerunning this command.

Exception:  File is not a zip file

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

Input In [18], in <module>
      1 from folktables import ACSDataSource, ACSEmployment
      3 data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')
----> 4 acs_data = data_source.get_data(states=["CA"], download=True)
      5 features, label, group = ACSEmployment.df_to_numpy(acs_data)

File D:\Programmierung\Jupyter\envs\master_thesis_pre\lib\site-packages\folktables\acs.py:34, in ACSDataSource.get_data(self, states, density, random_seed, join_household, download)
     32 def get_data(self, states=None, density=1.0, random_seed=0, join_household=False, download=False):
     33     """Get data from given list of states, density, and random seed. Optionally add household features."""
---> 34     data = load_acs(root_dir=self._root_dir,
     35                     year=self._survey_year,
     36                     states=states,
     37                     horizon=self._horizon,
     38                     survey=self._survey,
     39                     density=density,
     40                     random_seed=random_seed,
     41                     download=download)
     42     if join_household:
     43         orig_len = len(data)

File D:\Programmierung\Jupyter\envs\master_thesis_pre\lib\site-packages\folktables\load_acs.py:120, in load_acs(root_dir, states, year, horizon, survey, density, random_seed, serial_filter_list, download)
    116 first = True
    118 for file_name in file_names:
--> 120     with open(file_name, 'r') as f:
    122         if first:
    123             sample.write(next(f))

FileNotFoundError: [Errno 2] No such file or directory: 'data\\2018\\1-Year\\psam_p06.csv'
millerjohnp commented 2 years ago

Hi, it looks like you're using Windows, which may break some of the downloading and path manipulation logic in the repo. We don't support Windows, but it might be helpful to look into modifying os.path.join to work on your OS in the file folktables/load_acs.py. I'm closing this issue, but if you believe the downloading issue isn't Window's specific please re-open it.

JJEW22 commented 2 years ago

Hi, As a fellow windows user I ran into this same issue. Thought I would post the solution I came up with for any other windows users who want to use this dataset.

As @millerjohnp suggested, within folktables/load_acs.py find the function initialize_and_download. modify the line of code setting the url (line 73 at the time of writing this) From: url = os.path.join(base_url, remote_fname) To: url = base_url + "/" + remote_fname

The reason for this change is that os.path.join on windows will put a '\' in between the path, but for a url the separator should always be '/' no matter the os.

Note: There are other libraries that exist for combining url paths that could be used but this worked for my purposes.