timothyrenner / nuforc_sightings_data

Data collection and processing for the National UFO Reporting Center (NUFORC) database.
MIT License
35 stars 9 forks source link

Adapt to new table formatting #19

Closed tsepton closed 1 year ago

tsepton commented 1 year ago

Updated the scrapper to match the new format of NUFORC tables containing the reports

tsepton commented 1 year ago

Fix issue #18

tsepton commented 1 year ago

Stage geocode-reports broken by the new modification :

Running stage 'geocode-reports':
> python scripts/process_report_data.py data/raw/nuforc_reports.json data/external/cities.csv --output-file data/processed/nuforc_reports.csv
Traceback (most recent call last):
  File "/mnt/c/Users/tsept/Documents/Dev/nuforc_sightings_data/scripts/process_report_data.py", line 282, in <module>
    main()  # Click injects the arguments.
  File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/mnt/c/Users/tsept/Documents/Dev/nuforc_sightings_data/scripts/process_report_data.py", line 278, in main
    writer.writerow(report)
  File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/csv.py", line 154, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/home/tsepton-wsl/anaconda3/envs/nuforc/lib/python3.9/csv.py", line 149, in _dict_to_list
    raise ValueError("dict contains fields not in fieldnames: "
ValueError: dict contains fields not in fieldnames: 'country'
ERROR: failed to reproduce 'geocode-reports': failed to run: python scripts/process_report_data.py data/raw/nuforc_reports.json data/external/cities.csv --output-file data/processed/nuforc_reports.csv, exited with 1
tsepton commented 1 year ago

Everything seem to be working fine now, I'll try to scrap all NUFORC data later on today

timothyrenner commented 1 year ago

Awesome! Thanks so much. I've blocked some time on Monday to take a look myself, so if there are any outstanding issues then I'll iron them out and merge.

timothyrenner commented 1 year ago

PR looks good. I am running locally to compare with the previous run to see what that new column looks like in the final data. Once I've confirmed everything looks good I will merge. I'll probably upload the newly refreshed data to data.world too while I'm at it.

timothyrenner commented 1 year ago

So this is interesting - it looks like the older reports that I had previously scraped are no longer properly linking to the site - the report link structure has changed. This has introduced duplicates.

timothyrenner commented 1 year ago

Alright it looks like the new data is a superset of the old data - I'm good merging this change. Thanks so much for fixing this @tsepton !

tsepton commented 1 year ago

My pleasure, thank you for sharing this work