Open marcderbauer opened 6 months ago
Not all pdfs I understood to be letters are actually letters.
S_2020_914.pdf
is very much a letter
S_2020_994.pdf
is a Report
Also S_2020_1236.pdf
seems to be broken
S_2020_994.pdf
seems to be the only report. Will just skip / remove this one and assume that S_d{4}_d+.pdf
signifies letters
Each extracted file should have a high-level field type
which indicates the pdf_type
Information that can be extracted
Might just skip doing this for now. There's only 5 Agendas downloaded and they don't really carry much info
Same as regular transcriptions. Maybe best to keep them separately for now. Just need to sure the file / naming system can handle the filename.
Information that can be extracted
80-90% of meeting transcripts should be extracted correctly now
There are some which need to be handled separately:
S_PV.4547.pdf
)S_PV.4049.pdf
,S_PV.4049 (Resumption 1).pdf
,S_PV.4049 (Resumption 2).pdf
)S_2020_527.pdf
,S_2021_22.pdf
)S_Agenda_4219.pdf
)S_ PV.6385.pdf
→ Space in name, seems to be a one-off case, can maybe skip)S_PV.4355 (Resumption 1)_Corr.1.pdf
) → Also one-off, can manually fixAlmost all of these can be inferred by filename. There could be a nice high-level if/else regex mechanism to deal separately with each of those. This could go in the process_document function. In the spirit of incrementality, we could also just skip these cases for now.
The only one which can't be inferred by title are the Official Communiqués. For these you need to open the file first. Nevertheless, they can be identified with the
_is_communique_of_closed_meeting
function.