sfcpc / housing-dashboard

4 stars 0 forks source link

[Schemaless] diff bug: when reading the same file, duplicate values are printed #163

Open sbuss opened 4 years ago

sbuss commented 4 years ago

The outputs should be identical because they have the same inputs and the second create_schemaless command is diffing against the previous run.

python3 -m schemaless.create_schemaless \
    --planning_file testdata/planning-one.csv \
    --parcel_data_file=data/assessor/2020-02-18-parcels.csv.xz \
    out1.csv
python3 -m schemaless.create_schemaless \
    --planning_file testdata/planning-one.csv \
    --parcel_data_file=data/assessor/2020-02-18-parcels.csv.xz \
    --diff out1.csv \
    out2.csv
wc -l out*
  3274 out1.csv
  3292 out2.csv
  6566 total

Looking at the diff, there are a bunch of description values repeated at the bottom of the file:

...
planning_2002.0809,planning,2020-03-11,description,"03/06/2003 Shadow Study Prop. K   The proposed project would demolish the 22 three-story buildings and construct 17 three story affordable housing structures containing a total of 247 units, plus community space and child care designed as a PUD with exceptions from parking and loading requirements."
...
planning_2002.0809,planning,2020-03-11,description,"03/06/2003 Shadow Study Prop. K   The proposed project would demolish the 22 three-story buildings and construct 17 three story affordable housing structures containing a total of 247 units, plus community space and child care designed as a PUD with exceptions from parking and loading requirements."
...

These lines appear identical. Needs investigation.

sbuss commented 4 years ago

Of note: only description values are duplicated -- no other keys are present.