Wastewater extract is failing with ValueError due to time data.

klenwell commented 9 months ago

Started around start of the month. This is the error:

ValueError: time data '' does not match format '%m/%d/%Y'

This is the trace:

Traceback (most recent call last):
  File "app.py", line 14, in <module>
    app.run()
  File "/var/lib/jenkins/pyenv/versions/3.8.1/lib/python3.8/site-packages/cement/core/foundation.py", line 916, in run
    return_val = self.controller._dispatch()
  File "/var/lib/jenkins/pyenv/versions/3.8.1/lib/python3.8/site-packages/cement/ext/ext_argparse.py", line 808, in _dispatch
    return func()
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/controllers/oc_controller.py", line 94, in wastewater
    export.to_csv()
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/exports/oc_wastewater.py", line 73, in to_csv
    for dated in reversed(self.dates):
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/exports/oc_wastewater.py", line 44, in dates
    return sorted(self.extract.dates)
  File "/var/lib/jenkins/pyenv/versions/3.8.1/lib/python3.8/functools.py", line 966, in __get__
    val = self.func(instance)
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/extracts/cdph/oc_detailed_wastewater_extract.py", line 177, in dates
    for n in range(int((self.ends_on - self.starts_on).days) + 1):
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/extracts/cdph/oc_detailed_wastewater_extract.py", line 190, in ends_on
    return self.report_dates[-1]
  File "/var/lib/jenkins/pyenv/versions/3.8.1/lib/python3.8/functools.py", line 966, in __get__
    val = self.func(instance)
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/extracts/cdph/oc_detailed_wastewater_extract.py", line 168, in report_dates
    for row in self.oc_rows:
  File "/var/lib/jenkins/pyenv/versions/3.8.1/lib/python3.8/functools.py", line 966, in __get__
    val = self.func(instance)
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/extracts/cdph/oc_detailed_wastewater_extract.py", line 122, in oc_rows
    row['date'] = self.date_str_to_date(date)
  File "/var/lib/jenkins/jobs/OC COVID-19 Wastewater Export/workspace/covid-19/covid_app/extracts/cdph/oc_detailed_wastewater_extract.py", line 288, in date_str_to_date
    return datetime.strptime(date_sub, format).date()
  File "/var/lib/jenkins/pyenv/versions/3.8.1/lib/python3.8/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/var/lib/jenkins/pyenv/versions/3.8.1/lib/python3.8/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '' does not match format '%m/%d/%Y'

klenwell commented 9 months ago

Link to CDPH data source:

https://data.ca.gov/dataset/b8c6ee3b-539d-4d62-8fa2-c7cd17c16656/resource/16bb2698-c243-4b66-a6e8-4861ee66f8bf/download/master-covid-public.csv

klenwell commented 9 months ago

Next step: analyze source file. Compare current one with last known good version.

Originally, I was thinking of doing something with sed/awk but I think I'll add a class method to the extract class instead. Some details I want to compare:

row count
column headers
set of zips
OC zips
set of labs (try to identify unique sources for readings)

klenwell commented 9 months ago

I discovered that there are now 3 OC zip codes reported where there used to be only one.

zip codes: {'92629', '92677', '92708'}

Most recent row for each:

# ['zipcode', 'wwtp_name', 'facility_name', 'sample_collect_date', 'lab_id', 'sample_id', 'site_id']
Laguna Niguel: ['92677', 'Regional Treatment Plant', 'Regional Treatment Plant', '09/25/2023', 'VLT', '158-230925', '06059-001-02-00-00']
Dana Point: ['92629', 'JB Latham Treatment Plant', 'JB Latham Treatment Plant', '09/25/2023', 'VLT', '157-230925', '06059-002-01-00-00']
Fountain Valley['92708', 'OCSD_P1', 'OC San (Orange County Sanitation District) Reclamation Plant No. 1', '12/29/2022', 'CAL3', 'OCSD_P1449240.318', '06059-003-01-00-00']

Maybe Dana Point is the most reliable source?

klenwell commented 9 months ago

New discovery: the state CSV file includes virus readings not only for Covid but also other viruses like Norovirus and RSV. So I've updated my export to filter out only Covid data. I've also reformatted it to normalize the data rows and include more info for easier processing.

To test:

$ python app.py oc wastewater --mock

klenwell commented 9 months ago

Resolved

It appears that around Oct 1, CDPH changed the format of its wastewater data file. I rewrote my extract class to parse more sanely. Along the way I discovered that the fie contains data for different types of viruses, not just Covid. I modified my export to delineate the different OC sample sites and report types. Then I updated the extracts that depend on it to use it. All together, I believe the code is simpler and will be easier to modify, if needed, in future.

klenwell / covid-19

Wastewater extract is failing with ValueError due to time data. #97

Resolved