uga-libraries / congressional-mail

Providing basic access to metadata from congressional correspondent system exports.
Creative Commons Attribution Share Alike 4.0 International
0 stars 0 forks source link

CSS Archiving Format #1

Open amhanson9 opened 2 weeks ago

amhanson9 commented 2 weeks ago

Remove columns with PII (most granular identifying detail should be the zip code) and make an additional copy of the data split into one spreadsheet per Congress Year (two years, starting on odd years) so large data sets can be opened in spreadsheet programs.

CSS Archiving Format has all metadata in one table, with 32 fields. Fields are tab delimited. See the layout.txt file for more details.

amhanson9 commented 1 week ago

Columns removed for PII: Prefix, First Name, Middle Name, Last Name, Suffix, Appellation, Title, Organization Name, Address Line 1-4.

If one of these columns is not present, the script proceeds without an error. It also prints the columns that remain for archivist review for anything else that might need to be removed.

amhanson9 commented 1 week ago

Currently, if there is a parsing error (delimiter also used within data and cannot split into columns properly), the row is not included. And if there is an encoding error, the character is not included.

amhanson9 commented 1 week ago

Currently, data is split by in_date into Congress Year, and if there is no year (column is blank or has text instead of an actual date), it is saved in a separate CSV as undated. Text in the date column typically means the row(s) have extra data or are missing data and therefore are not lined up with the columns correctly.