Closed tsepton closed 1 year ago
Fix issue #18
Stage geocode-reports
broken by the new modification :
Running stage 'geocode-reports':
> python scripts/process_report_data.py data/raw/nuforc_reports.json data/external/cities.csv --output-file data/processed/nuforc_reports.csv
Traceback (most recent call last):
File "/mnt/c/Users/tsept/Documents/Dev/nuforc_sightings_data/scripts/process_report_data.py", line 282, in <module>
main() # Click injects the arguments.
File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/mnt/c/Users/tsept/Documents/Dev/nuforc_sightings_data/scripts/process_report_data.py", line 278, in main
writer.writerow(report)
File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/csv.py", line 154, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/csv.py", line 149, in _dict_to_list
raise ValueError("dict contains fields not in fieldnames: "
ValueError: dict contains fields not in fieldnames: 'country'
ERROR: failed to reproduce 'geocode-reports': failed to run: python scripts/process_report_data.py data/raw/nuforc_reports.json data/external/cities.csv --output-file data/processed/nuforc_reports.csv, exited with 1
Everything seem to be working fine now, I'll try to scrap all NUFORC data later on today
Awesome! Thanks so much. I've blocked some time on Monday to take a look myself, so if there are any outstanding issues then I'll iron them out and merge.
PR looks good. I am running locally to compare with the previous run to see what that new column looks like in the final data. Once I've confirmed everything looks good I will merge. I'll probably upload the newly refreshed data to data.world too while I'm at it.
So this is interesting - it looks like the older reports that I had previously scraped are no longer properly linking to the site - the report link structure has changed. This has introduced duplicates.
Alright it looks like the new data is a superset of the old data - I'm good merging this change. Thanks so much for fixing this @tsepton !
My pleasure, thank you for sharing this work
Updated the scrapper to match the new format of NUFORC tables containing the reports