marcderbauer / un_sec_council

0 stars 0 forks source link

Spike non-standard Meeting Transcripts #1

Open marcderbauer opened 6 months ago

marcderbauer commented 6 months ago

80-90% of meeting transcripts should be extracted correctly now

There are some which need to be handled separately:

Almost all of these can be inferred by filename. There could be a nice high-level if/else regex mechanism to deal separately with each of those. This could go in the process_document function. In the spirit of incrementality, we could also just skip these cases for now.
The only one which can't be inferred by title are the Official Communiqués. For these you need to open the file first. Nevertheless, they can be identified with the _is_communique_of_closed_meeting function.

marcderbauer commented 6 months ago

Issue

Not all pdfs I understood to be letters are actually letters.

S_2020_914.pdf is very much a letter

image

S_2020_994.pdf is a Report

image

Also S_2020_1236.pdf seems to be broken

Update

S_2020_994.pdf seems to be the only report. Will just skip / remove this one and assume that S_d{4}_d+.pdf signifies letters

marcderbauer commented 6 months ago

GENERAL

Each extracted file should have a high-level field type which indicates the pdf_type

Letters

Information that can be extracted

Agendas

Might just skip doing this for now. There's only 5 Agendas downloaded and they don't really carry much info

Resumptions

Same as regular transcriptions. Maybe best to keep them separately for now. Just need to sure the file / naming system can handle the filename.

Communiqués

Information that can be extracted