okfn-brasil / querido-diario-data-processing

Text processing repository to free brazilian municipal gazettes from closed file formats for the Querido Diário project.
MIT License
20 stars 17 forks source link

Problema de segmentação na associação de Alagoas no diário de 14/12/23 #67

Open ogecece opened 11 months ago

ogecece commented 11 months ago

Logs do erro:

Dec 14 22:04:22: WARNING:root:Could not process gazette: 2700000/2023-12-14/ab3183d921f7bb119a6a253e6dc59dc6fb07a367.pdf. Cause: 'Couldn\'t find info for "albarrasaomiguelal"'
Dec 14 22:04:22: ERROR:root:'Couldn\'t find info for "albarrasaomiguelal"'
Dec 14 22:04:22: Traceback (most recent call last):
Dec 14 22:04:22:   File "/mnt/code/tasks/gazette_text_extraction.py", line 32, in extract_text_from_gazettes
Dec 14 22:04:22:     document_ids = try_process_gazette_file(
Dec 14 22:04:22:   File "/mnt/code/tasks/gazette_text_extraction.py", line 69, in try_process_gazette_file
Dec 14 22:04:22:     territory_segments = segmenter.get_gazette_segments(gazette)
Dec 14 22:04:22:   File "/mnt/code/segmentation/segmenters/al_associacao_municipios.py", line 24, in get_gazette_segments
Dec 14 22:04:22:     gazette_segments = [
Dec 14 22:04:22:   File "/mnt/code/segmentation/segmenters/al_associacao_municipios.py", line 25, in <listcomp>
Dec 14 22:04:22:     self.build_segment(territory_slug, segment_text, gazette).__dict__
Dec 14 22:04:22:   File "/mnt/code/segmentation/segmenters/al_associacao_municipios.py", line 65, in build_segment
Dec 14 22:04:22:     territory_data = get_territory_data(territory_slug, self.territories)
Dec 14 22:04:22:   File "/mnt/code/tasks/utils/territories.py", line 28, in get_territory_data
Dec 14 22:04:22:     raise KeyError(f"Couldn't find info for \"{territory_slug}\"")
Dec 14 22:04:22: KeyError: 'Couldn\'t find info for "albarrasaomiguelal"'

Provavelmente seria suficiente alterar o _normalize_territory_name() do segmentador e incluir esse caso:

image