marcderbauer commented 6 months ago

80-90% of meeting transcripts should be extracted correctly now

There are some which need to be handled separately:

Official Communiqués of Closed Meetings (e.g. S_PV.4547.pdf)
Resumptions (e.g. S_PV.4049.pdf, S_PV.4049 (Resumption 1).pdf, S_PV.4049 (Resumption 2).pdf)
Letters (e.g. S_2020_527.pdf, S_2021_22.pdf)
Agendas (e.g. S_Agenda_4219.pdf)
Typos (e.g. S_ PV.6385.pdf → Space in name, seems to be a one-off case, can maybe skip)
Corrigendums (e.g. S_PV.4355 (Resumption 1)_Corr.1.pdf) → Also one-off, can manually fix

Almost all of these can be inferred by filename. There could be a nice high-level if/else regex mechanism to deal separately with each of those. This could go in the process_document function. In the spirit of incrementality, we could also just skip these cases for now.
The only one which can't be inferred by title are the Official Communiqués. For these you need to open the file first. Nevertheless, they can be identified with the _is_communique_of_closed_meeting function.

marcderbauer commented 6 months ago

Issue

Not all pdfs I understood to be letters are actually letters.

S_2020_914.pdf is very much a letter

S_2020_994.pdf is a Report

Also S_2020_1236.pdf seems to be broken

Update

S_2020_994.pdf seems to be the only report. Will just skip / remove this one and assume that S_d{4}_d+.pdf signifies letters

marcderbauer commented 6 months ago

GENERAL

Each extracted file should have a high-level field type which indicates the pdf_type

Letters

Information that can be extracted

when the letter was dated (different from when it was published online)
who signed the letter (name of the President of the security council; similar president entry as with transcript)
text
original language

Agendas

Might just skip doing this for now. There's only 5 Agendas downloaded and they don't really carry much info

Resumptions

Same as regular transcriptions. Maybe best to keep them separately for now. Just need to sure the file / naming system can handle the filename.

Communiqués

Information that can be extracted

When the meeting was held
Original Language
Rest of text as full block The formatting here is slightly different than with transcriptions

marcderbauer / un_sec_council

Spike non-standard Meeting Transcripts #1

Issue

Update

GENERAL

Letters

Agendas

Resumptions

Communiqués