thegreenwebfoundation / datasets

Open datasets & methodologies for carbon emissions from different activities. Forked from OpenAMEE, and npm installable
MIT License
9 stars 1 forks source link

Deal with `units` column in data.csv file #17

Open mamhoff opened 5 years ago

mamhoff commented 5 years ago

All of our data.csv files have a units column. However, it's unclear why it is there as the itemdef.csv files ALSO have unit and per_unit columns, duplicating the information in the data.csv file.

In most datasets, the units column in the data.csv file is completely empty. In others, it duplicates the information in the itemdef.csv file.

Here's a ruby script that deletes the column if it's entirely unnecessary or shows some information:

#!/usr/bin/env ruby

require 'csv'
require 'pry'

data_csvs = Dir.glob('**/data.csv')

def unnecessary_units_column?(csv)
  csv.all? do |row|
    row['units'].nil? ||
      row['units'].downcase == 'none' ||
      row['units'].downcase == 'dummy' ||
      row['units'].downcase == 'n/a' ||
      row['units'].downcase =~ / or /
  end
end

data_csvs.each do |file|
  begin
    csv = CSV.read(file, headers: true)
    if unnecessary_units_column?(csv)
      CSV.open(file, 'wb') do |new_csv|
        new_csv << csv.headers.reject { |h| h == 'units' }
        csv.each do |row|
          new_csv << row.reject { |k, v| k == 'units'}.map { |k, v| v }
        end
      end
    else
      puts "#{file}: #{csv.map {|r|r['units']}.uniq}"
    end
  rescue => e
    puts "ERROR in #{file}: #{e}"
  end
end
mamhoff commented 5 years ago

Here's the output of the script. The task would be to identify whether the units column in the respective data.csv is unnecessary because the same information could be coded into the unit or per_unit column of the corresponding itemdef.csv file. Also, a few files have encoding errors.

business/energy/us/subregion/data.csv: ["kg/kWh"]
business/energy/electricity/india/byGrid/data.csv: ["kg/kWh"]
business/energy/electricity/china/byGrid/data.csv: ["kg/kWh"]
business/processes/production/adipicAcid/data.csv: ["kgN2O/kgacid"]
business/processes/production/pulpAndPaper/directEmissions/data.csv: ["kgCO2/kgChemical"]
business/processes/production/ammonia/data.csv: ["none", "%w/w"]
business/processes/production/lime/production/data.csv: ["fraction"]
business/processes/production/lime/carbonate/data.csv: ["fraction"]
business/processes/production/aluminium/soderberg/data.csv: ["kg cyclo/kg Al"]
business/processes/production/aluminium/pfc/slope/data.csv: ["fraction"]
business/processes/production/aluminium/pfc/defaults/data.csv: ["kg/kgAl"]
business/processes/production/aluminium/pfc/overvoltage/data.csv: ["fraction"]
business/processes/production/aluminium/defaults/data.csv: ["kgCO2/kgAl"]
business/processes/production/aluminium/prebake/pitchCooking/data.csv: ["percent"]
business/processes/production/nitricAcid/data.csv: ["kgN20/kgHNO"]
business/processes/production/hcfc22/productionDataOnly/data.csv: ["kgHFC-23/kgHCFC-22"]
business/processes/production/cement/epa/data.csv: ["percent"]
business/processes/production/ironandsteel/ironAndSteel/data.csv: ["kg C/kg material"]
business/processes/production/ironandsteel/coke/data.csv: ["kg C/kg material"]
business/processes/production/ironandsteel/sinter/data.csv: ["kg C/kg material"]
business/buildings/hotel/generic/data.csv: ["kgCO2/night"]
planet/country/uk/average/appliances/data.csv: ["kgCO2/year"]
planet/country/uk/average/travel/data.csv: ["kgCO2/year"]
planet/country/uk/average/home/data.csv: ["kgCO2/year"]
planet/country/uk/aggregate/actonco2/peoplelikeme/appliances/data.csv: ["kgCO2/year"]
planet/country/uk/aggregate/actonco2/peoplelikeme/travel/data.csv: ["kg/year"]
planet/country/uk/aggregate/actonco2/peoplelikeme/home/data.csv: ["kgCO2/year"]
planet/co2Sinks/rainforest/data.csv: ["kg/km^2"]
ERROR in personal/generic/data.csv: invalid byte sequence in UTF-8
transport/taxi/generic/perpassenger/data.csv: ["kgCO2/km.passenger"]
transport/taxi/generic/data.csv: ["kgCO2/km "]
transport/train/generic/data.csv: ["kgCO2/km.passenger"]
transport/minibus/generic/data.csv: ["kgCO2/km"]
transport/plane/generic/data.csv: ["kgCO2/pass.journey ", "kgCO2/pass.km", "N/A"]
transport/plane/generic/freight/defra/data.csv: ["kgCO2e/tkm"]
transport/plane/generic/defra/data.csv: ["kgCO2e/pkm", "kgCO2/pkm", "N/A"]
transport/plane/generic/airports/all/codes/data.csv: ["degrees N and E"]
transport/plane/generic/airports/all/countries/data.csv: ["degrees N and E"]
transport/plane/generic/airports/codes/data.csv: ["degrees N and E"]
transport/plane/generic/airports/countries/data.csv: ["degrees N and E"]
transport/plane/generic/passengerclass/data.csv: ["kgCO2/pass.km"]
transport/other/data.csv: ["kgCO2/km", "kgCO2/launch"]
transport/van/generic/data.csv: ["kgCO2/km "]
transport/car/generic/data.csv: ["kgCO2/km ", "kgCO2/km"]
transport/car/generic/electric/data.csv: ["kWh/km"]
transport/car/bands/ireland/data.csv: ["kgCO2/km "]
transport/ship/generic/data.csv: ["kgCO2/km.passenger"]
transport/ship/generic/freight/data.csv: ["kgCO2/kg.km"]
transport/bus/generic/data.csv: ["kgCO2/km.passenger"]
ERROR in home/water/data.csv: Unquoted fields do not allow \r or \n (line 4).
home/water/defra/data.csv: ["kgCO2/m^3"]
home/water/reductions/data.csv: ["litres/day"]
home/appliances/kitchen/generic/data.csv: ["kWh/year", "kWh/cycle", "N/A"]
home/appliances/energystar/kitchen/refrigerators/data.csv: ["kWh/year"]
home/appliances/energystar/kitchen/freezers/data.csv: ["kWh/year"]
home/appliances/energystar/kitchen/clothesWashers/data.csv: ["kWh/year"]
home/appliances/energystar/kitchen/dishwashers/data.csv: ["kWh/year"]
home/appliances/energystar/office/computers/desktopsAndIntegrated/data.csv: ["kWh/year"]
home/appliances/energystar/office/computers/workstations/data.csv: ["kWh/year"]
home/appliances/energystar/office/computers/notebooksAndTablets/data.csv: ["kWh/year"]
home/appliances/energystar/office/imageEquipment/faxMachines/data.csv: ["kWh/year"]
home/appliances/energystar/office/imageEquipment/printers/data.csv: ["kWh/year"]
home/appliances/energystar/office/imageEquipment/digitalDuplicators/data.csv: ["kWh/year"]
home/appliances/energystar/office/imageEquipment/copiers/data.csv: ["kWh/year"]
home/appliances/energystar/office/imageEquipment/multiFunctionDevices/data.csv: ["kWh/year"]
home/appliances/energystar/entertainment/setTopBoxes/data.csv: ["kWh/year"]
home/appliances/energystar/entertainment/televisionsAndCombinationUnits/data.csv: ["kWh/year"]
home/appliances/cooking/us/data.csv: ["kWh/year"]
home/appliances/cooking/oven/data.csv: ["kWh/year"]
home/appliances/cooking/hob/data.csv: ["kWh/year"]
home/appliances/entertainment/generic/data.csv: ["kWh/year", "N/A"]
home/appliances/televisions/generic/ranges/data.csv: ["kW"]
home/appliances/computers/generic/data.csv: ["kWh/Year", "kWh/year", "N/A"]
home/energy/us/price/data.csv: ["kgCO2/USD"]
home/energy/us/state/data.csv: ["kgCO2/kWh"]
ERROR in home/energy/uk/price/data.csv: invalid byte sequence in UTF-8
home/energy/uk/reductions/data.csv: ["kgCO2", "kgco2"]
home/energy/uk/suppliers/data.csv: ["kgCO2/kWh"]
ERROR in home/energy/electricity/data.csv: invalid byte sequence in UTF-8
home/energy/electricity/realTimeElectricity/fuelEmissionFactors/data.csv: ["kgCO2/kWh"]
home/energy/electricity/realTimeElectricity/data.csv: ["kWh"]
home/energy/insulation/data.csv: ["N/A", "u1"]
home/energy/electricityiso/data.csv: ["kgCO2/kWh"]
home/energy/ireland/suppliers/data.csv: ["kgCO2/kWh"]
home/heating/us/data.csv: ["kWh/year"]
home/heating/uk/renewable/data.csv: ["kWh/year"]
home/heating/uk/floorareas/data.csv: ["metres squared"]