vkoves / electrify-chicago

Learn about Chicago's most polluting buildings, and what they can do to clean up their act!
http://electrifychicago.net
9 stars 5 forks source link

Issue 41 add automated tests for cleaning file for all years #68

Closed alexkcode closed 3 months ago

alexkcode commented 5 months ago

Description

This adds unit tests for Python script that cleans source data that contains building data for all years. The actual transformations tested should be explained by the title of each test.

Fixes # 41

Testing Instructions

Run python3 -m pytest test/data/scripts/unit/test_clean_all_years.py from base project directory. All tests should pass as indicated by console pytest output.

Checklist:

netlify[bot] commented 5 months ago

Deploy Preview for radiant-cucurucho-d09bae ready!

Name Link
Latest commit 031e4eaf5c76d087f1fa5492e72b2b32a34e2b5b
Latest deploy log https://app.netlify.com/sites/radiant-cucurucho-d09bae/deploys/65f0ff889f257f0008af28eb
Deploy Preview https://deploy-preview-68--radiant-cucurucho-d09bae.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

alexkcode commented 4 months ago

Current test cases only include buildings from 'United Center', 'Crown Hall', 'Art Institute'

alexkcode commented 4 months ago

Current runtime is at around 21 seconds

vkoves commented 4 months ago

Issue: It looks like the actual data process now doesn't run, maybe due to the new imports?

alexkcode commented 4 months ago

@vkoves fixed all the pathing issues!

vkoves commented 4 months ago

QA list for myself:

alexkcode commented 4 months ago

Currently producing a slightly different output csv's than original script. Not sure why since the types and numbers are the same, but I'm still investigating.

vkoves commented 4 months ago

@alexkcode - looks like there's an issue creating the debug JSON, and some numeric columns become repeating numbers, it looks like it's the latitude and longitude column:

Screenshot from 2024-02-20 20-48-01

It's easiest to debug this by looking at the line by line diff, and then your editor will highlight the specific part of each line that changed.

alexkcode commented 4 months ago

After I changed the string type conversion in the script some of the columns like ChicagoEnergyRating are now empty spaces, instead of Pandas type nan, in the produced .csv files. I think this is actually the correct behavior, but please let me know if it is not.

alexkcode commented 4 months ago

For specific string/column formats, we should check against the original source raw CSV (or as close as possible to it). I will change the Latitude, Longitude column rests and the other tests that test against the original values for the different direct type conversions that take place in the scripts.

alexkcode commented 4 months ago

Rendered websites (production vs. local) appear the same, i.e. the repeating decimals don't show up, unless my eyes missed something.

alexkcode commented 4 months ago

Could potentially be Pandas issue (from an update or something), will look into this.

alexkcode commented 3 months ago

Found decimal issue origin: before the conversion to string type in the Python scripts an interpretation of the data types is already by the get_and_clean_csv function.

alexkcode commented 3 months ago

@vkoves Please re-review!

vkoves commented 3 months ago

@alexkcode - I re-ran the data processing to make sure everything is up to date, but the test now fails with this one error:

Screenshot from 2024-03-12 20-16-04

I think that is related to the issue you filed, so I'll still merge this, since this isn't running in CI, but I'd love your help to fix that. I wonder if we could fix this just by rounding these floats to a certain relevant precision.