stephbuon / digital-history

Instructional repository for "Text Mining as Historical Method"
GNU General Public License v3.0
7 stars 3 forks source link

Add specified cells to access_data.ipynb #5

Closed stephbuon closed 3 years ago

stephbuon commented 3 years ago

Add four new cells to digital-history/access-data/access_data.ipynb, in the section for accessing congressional data:

Text cell:

The data is divided into many .txt files.

Speeches are in different .txt files (files labeled speaches) from metadata like dates (files labeled descr). The following code iterates through every speech file and every descr file and creates DataFrames containing each of these file types.

Code cell:

`import glob import os import csv import pandas as pd

directory = '/scratch/group/oit_research_data/stanford_congress/hein-bound/' file_type = 'txt' seperator ='|'

speeches_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTENONE) for f in glob.glob(directory + "speeches*"+file_type)])

descr_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTENONE) for f in glob.glob(directory + "descr*"+file_type)])`

Text:

Now we can create a large DataFrame that combines the speeches with the metadata.

Code:

all_data = pd.merge(speeches_df, descr_df, on='speech_id').fillna(0)

alexanderr commented 3 years ago

Fixed in ef878f858269f9f876983856fdfa06fa7a0870e6