jvns / pandas-cookbook

Recipes for using Python's pandas library
6.68k stars 2.32k forks source link

Chapter 5 - the url template is outdated leading to 404: Not Found #50

Open jakobkolb opened 8 years ago

jakobkolb commented 8 years ago

Apparently, the site for Canadian historical weather data changed their site.

GillesMoyse commented 8 years ago

2 things to fix in the notebook :

Sent a PR.

andreas-h commented 8 years ago

also, the encoding='latin1' should go (at least on Python3)

hsuanie commented 7 years ago

Hello. I tried with the updated codes. But I got an error stating as follows: File b'data/eng-hourly-03012012-03312012.csv' does not exist

Please kindly help me thanks!

Enkerli commented 6 years ago

At this point (July 2018), the following works in Python3: In[]: url_template = "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=5415&Year={year}&Month={month}&format=csv&timeframe=1&submit=%20Download+Data"

and: In[]: url = url_template.format(month=3, year=2012) weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='utf-8', header=0)

An important change, apart from the URL itself, is that header accepts an integer (row number) instead of a boolean.

Because of the encoding change, we need to change this, as well: In[]: weather_mar2012[u"Temp (°C)"].plot(figsize=(15, 5))

Also, the “Data Quality” column disappeared. This requires tweaks while working with columns.

In[]: weather_mar2012.columns = [ u'Year', u'Month', u'Day', u'Time', u'Temp (C)', u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag', u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag', u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', u'Wind Chill Flag', u'Weather'] In[]: weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time'], axis=1)

In[]:

def download_weather_month(year, month):
    if month == 1:
        year += 1
    url = url_template.format(year=year, month=month)
    weather_data = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, header=0)
    weather_data = weather_data.dropna(axis=1)
    weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
    weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time'], axis=1)
    return weather_data
mvresh commented 5 years ago

At this point (July 2018), the following works in Python3: In[]: url_template = "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=5415&Year={year}&Month={month}&format=csv&timeframe=1&submit=%20Download+Data"

and: In[]: url = url_template.format(month=3, year=2012) weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='utf-8', header=0)

An important change, apart from the URL itself, is that header accepts an integer (row number) instead of a boolean.

Because of the encoding change, we need to change this, as well: In[]: weather_mar2012[u"Temp (°C)"].plot(figsize=(15, 5))

Also, the “Data Quality” column disappeared. This requires tweaks while working with columns.

In[]: weather_mar2012.columns = [ u'Year', u'Month', u'Day', u'Time', u'Temp (C)', u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag', u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag', u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', u'Wind Chill Flag', u'Weather'] In[]: weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time'], axis=1)

In[]:

def download_weather_month(year, month):
    if month == 1:
        year += 1
    url = url_template.format(year=year, month=month)
    weather_data = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, header=0)
    weather_data = weather_data.dropna(axis=1)
    weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
    weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time'], axis=1)
    return weather_data

When using the url template and the weather data to compare the temperatures with bikes data, code seems to be not working. I modified url template and made the changes required in later parts, and everything is running well. But when I tried to output first three rows of the data, its showing nothing.

mvresh commented 5 years ago

Here's the code :

`

getting weather data to look at temps

 def get_weather_data(year):
      url_template = "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=5415&Year={year}&Month={month}&format=csv&timeframe=1&submit=%20Download+Data"

  # airport station is 5415, hence that was used

  data_by_month = []

  for month in range(1,13):

    url = url_template.format(year=year, month=month)
    weather_data = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='utf-8', header=0)
    weather_data.columns = map(lambda x: x.replace('\xb0', ''), weather_data.columns)

    # xbo is degree symbol

    weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time'], axis=1)
    data_by_month.append(weather_data.dropna())

  return pd.concat(data_by_month).dropna(axis=1, how='all').dropna()

weather_data = get_weather_data(2012)

weather_data[:5]

`

kbridge commented 2 years ago
url_template = "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=5415&Year={year}&Month={month}&format=csv&timeframe=1&submit=%20Download+Data"
# url_template = 'https://raw.githubusercontent.com/kbridge/weather-data/main/weather_data_{year}_{month}.csv'
url = url_template.format(month=3, year=2012)
weather_mar2012 = pd.read_csv(url, index_col='Date/Time (LST)', parse_dates=True, encoding='utf-8-sig')

Summary:

kbridge commented 2 years ago

Before renaming the columns to eliminate ° characters, drop some unexpected new columns first:

weather_mar2012 = weather_mar2012.drop(['Longitude (x)', 'Latitude (y)', 'Station Name', 'Climate ID', 'Precip. Amount (mm)', 'Precip. Amount Flag'], axis=1)

And the renaming code becomes

weather_mar2012.columns = [
    u'Year', u'Month', u'Day', u'Time', u'Temp (C)', 
    u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag', 
    u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', 
    u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag',
    u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', 
    u'Wind Chill Flag', u'Weather']

Column Data Quality is removed because the new data doesn't contain the column anymore.

This also renames the column Time (LST) to Time.

kbridge commented 2 years ago

No need to drop the column Data Quality anymore:

-weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)
+weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time'], axis=1)
kbridge commented 2 years ago

temperatures.head is a method now, so you should

-print(temperatures.head)
+print(temperatures.head())
kbridge commented 2 years ago

Change download_weather_month to this:

# mirror
# url_template = 'https://raw.githubusercontent.com/kbridge/weather-data/main/weather_data_{year}_{month}.csv'

def download_weather_month(year, month):
    url = url_template.format(year=year, month=month)
    weather_data = pd.read_csv(url, index_col='Date/Time (LST)', parse_dates=True, encoding='utf-8-sig')
    weather_data = weather_data.dropna(axis=1)
    weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
    weather_data = weather_data.drop([
        'Year',
        'Day',
        'Month',
        'Time (LST)',
        'Longitude (x)',
        'Latitude (y)',
        'Station Name',
        'Climate ID',
    ], axis=1)
    return weather_data

which was

def download_weather_month(year, month):
    if month == 1:
        year += 1
    url = url_template.format(year=year, month=month)
    weather_data = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, header=True)
    weather_data = weather_data.dropna(axis=1)
    weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
    weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
    return weather_data
kbridge commented 2 years ago

Sorry I have used this issue as if it is my own memo. But I will be glad if my comments help you.