Add four new cells to digital-history/access-data/access_data.ipynb, in the section for accessing congressional data:
Text cell:
The data is divided into many .txt files.
Speeches are in different .txt files (files labeled speaches) from metadata like dates (files labeled descr). The following code iterates through every speech file and every descr file and creates DataFrames containing each of these file types.
Code cell:
`import glob
import os
import csv
import pandas as pd
speeches_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTENONE) for f in glob.glob(directory + "speeches*"+file_type)])
descr_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTENONE) for f in glob.glob(directory + "descr*"+file_type)])`
Text:
Now we can create a large DataFrame that combines the speeches with the metadata.
Add four new cells to digital-history/access-data/access_data.ipynb, in the section for accessing congressional data:
Text cell:
The data is divided into many .txt files.
Speeches are in different .txt files (files labeled speaches) from metadata like dates (files labeled descr). The following code iterates through every speech file and every descr file and creates DataFrames containing each of these file types.
Code cell:
`import glob import os import csv import pandas as pd
directory = '/scratch/group/oit_research_data/stanford_congress/hein-bound/' file_type = 'txt' seperator ='|'
speeches_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTENONE) for f in glob.glob(directory + "speeches*"+file_type)])
descr_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTENONE) for f in glob.glob(directory + "descr*"+file_type)])`
Text:
Now we can create a large DataFrame that combines the speeches with the metadata.
Code:
all_data = pd.merge(speeches_df, descr_df, on='speech_id').fillna(0)