Track anomalies explanation

krober10nd commented 7 months ago

Hi, thank you for this open source work. I wanted to know if someone could perhaps explain why these odd looking trajectories are occurring?

I ran with 2016-2021 ERA5 with no changes to the codebase. You can see in the image below these almost satellite-esque swath-like trajectories that are clearly non-physical

tracks_NA_2016_2021

Thank you!

linjonathan commented 6 months ago

Are those tracks possibly ones with two points in time? If so, these probably could arise from too large of a covariance (in magnitude) that can result from not daily-averaging winds. I will push a new version in the coming days that should account for this.

Jonathan

krober10nd commented 6 months ago

Thanks Jonathan for the explanation. I will try with the latest release. For the record, those tracks were composed actually of many points in time. I replotted the same figure but with markers for each point along the trajectory. The anomalous tracks have many points. tracks_NA_2016_2021

linjonathan commented 6 months ago

Hm, interesting. I don't see this kinds of track when I run the model. Could you try the new version (v1.1) and let me know if you have the same issues?

krober10nd commented 6 months ago

Hi Jonathan, I've now installed the latest version that you've recently posted and I'm running with python3.10.

I still experience a similar problem.

I do receive a warning now that I did not prior

_/Users/keithroberts/Areas/OpenSourceTCDownscaling/tropical_cyclone_risk/track/bamtrack.py:91: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosec

In addition, I've tested now basin = GL and basin=NA for the same time interval (2016-2021) and got quite significantly different quantity of synthetic tracks.

For example, the GL basin and plotting only tracks labeled in the NA basin.

tracks_GL_2016_2021 output_netcdf_files_v1.zip

versus using the NA basin and plotting everything

tracks_NA_2016_2021_v2

For reference here is the track visualization code and I'm provided the output NetCDF files.

output_netcdf_files_v1.zip

'''
Visualize produced tracks
'''
import xarray as xr 
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature

ds = xr.open_dataset('data/era5/test2/tracks_NA_era5_201601_202112.nc')
print(ds)

# select tracks for 2016 
#ds_2016_NA = ds.copy()
# select only North Atlantic tracks
ds_2016_NA = ds.sel(basin='NA')
print(ds_2016_NA) 
# make longitude -180 to 180 from 0 to 360
#ds_2016_NA['lon_trks'] = (ds_2016_NA['lon_trks'] + 180) % 360 - 180

fig, ax = plt.subplots(subplot_kw={'projection': ccrs.PlateCarree()})
# loop over all tracks using the n_trk dimension
for i in range(ds_2016_NA['n_trk'].size):
    # select the i-th track
    trk = ds_2016_NA.sel(n_trk=i)
    print(trk.time.values[-1]-trk.time.values[0])
    # plot the track
    ax.plot(trk['lon_trks'], trk['lat_trks'], transform=ccrs.PlateCarree(),marker='o',markersize=0.3,linewidth=0.1)
ax.coastlines()
# show the ocean and land features
ax.add_feature(cfeature.OCEAN)
ax.add_feature(cfeature.LAND)
# zoom into the North Atlantic
ax.set_extent([-100, 0, 0, 60])
# label the title of the plot
plt.title('Tracks of North Atlantic, 2016-2021')
plt.savefig('data/era5/test2/tracks_NA_2016_2021_v2.png', dpi=300, bbox_inches='tight')

linjonathan commented 6 months ago

FYI, the basin variable is actually 'tc_basins', so that you should access NA tracks only using the following code:

lon_trks = ds['lon_trks'].where(ds['tc_basins'] == 'NA', drop = True)
lat_trks = ds['lat_trks'].where(ds['tc_basins'] == 'NA', drop = True)

And the following to plot:

ax.plot(lon_trks[i, :], lat_trks[i, :], transform=ccrs.PlateCarree(),marker='o',markersize=0.3,linewidth=0.1)

I've traced the issue to the wind steering. In your dataset, any tracks that occur during the years of 2020 and 2021 have this issue. You can check this using the 'tc_years' variable. I would suggest looking at your data files for the 2020 and 2021 period, and make sure they follow the file formats. For example, does each year have (at-least) daily level winds? My directory for 2020, for example, looks like the following:

era5_q_monthly_2020.nc   era5_sst_monthly_2020.nc  era5_u_daily_2020.nc
era5_sp_monthly_2020.nc  era5_t_monthly_2020.nc    era5_v_daily_2020.nc

You can also take a look at the _env_wnd_era5_201601202112.nc file that is output by the post-processing. Do the mean u250/u850/v250/v850 wind fields look reasonable?

I haven't been able to reproduce this issue, and given that this only occurs for tracks in the years 2020/2021 from the dataset you've given me, I don't think this is an issue with the code.

krober10nd commented 6 months ago

Thanks, I'll try this out and follow back. I can confirm that for 2021 the u-component of the daily winds did not download so this is indeed likely the issue.

krober10nd commented 6 months ago

Okay this checks out. Thanks.

I'm curious though how you can get the inter annual track frequency variability if you have to specify the number of tracks per year in the namelist.py? Each year produced exactly 20 tracks. How do you approach this in practice?

tracks_2015 tracks_2016 tracks_2017 tracks_2018 tracks_2019

linjonathan commented 6 months ago

The model is set so that each year has a fixed number of tracks. You can convert this to interannual variability of TC frequency under the assumption of constant seeding rate (per year) by normalizing the number of TCs per year by the number of seeds. In the below code, _seeds_peryear is the number of seeds per year.

drop_vars = ["lon_trks", "lat_trks", "u250_trks", "v250_trks", "u850_trks", "v850_trks",
                       "v_trks", "m_trks", "vmax_trks", "tc_month", "tc_years"]
ds_seeds = xr.open_mfdataset(fn_tracks, concat_dim = "year", combine = "nested",
                             data_vars="minimal", drop_variables = drop_vars)
seeds_per_month = ds_seeds['seeds_per_month']
seeds_per_year = seeds_per_month.groupby('year').mean(['year', 'month']

krober10nd commented 6 months ago

Thanks for the response. Sorry for the confusion but I think you meant to say if the seeding rate is variable (per year) then you can normalize otherwise you are dividing two constants (e.g., number of tracks and seeding rate per year). Still however how is that ratio used? Do you drop modeled tracks randomly somehow?

It could be helpful to have a helper function that reads in the dataset and normalizes the modeled tracks to produce realistic inter annual frequencies.

linjonathan commented 6 months ago

If you assume that in the real world the seeding rate is constant ever year, you can divide the number of TC events you have by the total number of seeds every year, to get a survival rate (# events per seed). Then, you have to multiple by an arbitrary constant (which is the total number of seeds every year), to get a TC frequency.

Yes, that's something in plan for the future.

krober10nd commented 6 months ago

Okay thanks but I think this is worth going into more detail..

As an example, I simulated 2016 for the North Atlantic basin. I have specified 20 TC events per the namelist. Following your logic, I divide 20 TC events by the total number of seeds every year (2,232 seeds in 2016 in my case), this gives me an average survival rate 0.00896057347 TC events per seed, then I'd multiply the seed rate per month to get the number of TC events in each month. This however produces the following figure which shows TC events occurring during far out of season months and the numbers are floats (number of events per month) which I guess you round to the nearest integer? I feel like something is missing.

events_per_month

import xarray as xr
import matplotlib.pyplot as plt

# Define file path and variables to drop
file_path = "data/era5/test4/tracks_NA_era5_201501_202012.nc"
variables_to_drop = ["lon_trks", "lat_trks", "u250_trks", "v250_trks", "u850_trks", "v850_trks",
                     "v_trks", "m_trks", "vmax_trks", "tc_month", "tc_years"]

# Load the dataset, concatenating along the 'year' dimension
ds_tracks = xr.open_mfdataset(file_path, concat_dim="year", combine="nested",
                              data_vars="minimal", drop_variables=variables_to_drop)

# Select seed data for the North Atlantic basin and year 2016
seeds_per_month_2016 = ds_tracks.sel(basin='NA', year=2016)['seeds_per_month']
seeds_per_year_2016 = seeds_per_month_2016.sum(dim='month')

# Define the number of events per year and calculate survival rate
events_per_year = 20
survival_rate = events_per_year / seeds_per_year_2016

# Calculate the number of events per month
events_per_month_2016 = seeds_per_month_2016 * survival_rate

# Convert to a pandas Series for plotting
events_per_month_series = events_per_month_2016.to_series()

# Plot the number of synthetic TC events per month
events_per_month_series.plot(kind='bar', color='skyblue')
plt.xticks(range(0, 12), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.ylabel('Number of TC Events')
plt.xlabel('Month in 2016')
plt.title('Number of Synthetic TC Events per Month\nNorth Atlantic 2016')

# Save and show the figure
plt.savefig('events_per_month.png', dpi=300, bbox_inches='tight')
plt.show()

tracks_NA_era5_201501_202012.nc.zip

krober10nd commented 6 months ago

I see now there is a tc_month variables produce a reasonable inter monthly variability.

krober10nd commented 6 months ago

Could you clarify your previous comment? Based on my previous posts, I have followed the instructions step by step as you suggested. It's unclear what you are trying to communicate.

linjonathan commented 6 months ago

The variable seeds_per_year in the above code gives the number of seeds needed to obtain the specified number of tracks, or n_tracks (in your case, 20). The quantity n_tracks / seeds_per_year gives the number of tracks per number of seeds every year. Since we don't know what the true number of seeds is per year, we multiply by an arbitrary constant with units [number of seeds] to yield a storm count every year. That constant is chosen such that the average of n_tracks / seeds_per_year is equal to the observed number of storms. The seed survival rate varies by basin, but in my experience is on average around 5-10%.

If you want to look at the seasonal cycle of seed survival rate, you need to accumulate the number of storms for each month, and divide that by the number of seeds for that given month.

H = np.histogram(ds_tracks['tc_month'], bins = np.arange(0.5, 12.6, 1))
plt.plot(H[0] / seeds_per_month.sum('year'))

krober10nd commented 5 months ago

Thanks! So then I assume you just select randomly N tracks per year based on the chosen constant multiplied by the n_tracks / seeds_per_year to get a realistic synthetic climatology (in terms of frequency).

linjonathan commented 5 months ago

That sounds about right.

krober10nd commented 1 month ago

To summarize, this script subsets the NetCDF file with all the tracks based on a computed inter-annual variability. It also makes plots showing the intra and inter annual variability. monthly_distribution_AU_1979_2020_v1 inter_annual_variability_AU_1979_2020_v1

import xarray as xr
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# import matplotlib
# matplotlib.use('TkAgg')

######################### Load the dataset and calculate the intra-annual variability #########################
file_path = "data/era5/australia/tracks_AU_era5_197901_202012.nc"
n_tracks = 20  # Each year has a default of 20 tracks
BASIN = "AU"  # The basin to consider
START_YEAR = 1979  # The start year of the dataset
END_YEAR = 2020  # The end year of the dataset
CLIMO_TRACKS_PER_YEAR = (
    6.5  # The number of events per year on avg. based on a climatology
)
######################### Load the dataset and calculate the intra-annual variability #########################
# Load the dataset, concatenating along the 'year' dimension
ds_tracks = xr.open_dataset(file_path)

# Restrict to the basin
ds_tracks = ds_tracks.where(ds_tracks["basin"] == BASIN, drop=True)

# compute the seeds_per_year by summing up the seeds_per_month
seeds_per_year = ds_tracks["seeds_per_month"].sum(dim="month")

mean_seeds_per_year = seeds_per_year.mean().values

# compute the seeds_per_month by averaging the seeds_per_month
seeds_per_month = ds_tracks["seeds_per_month"].mean(dim="year")

survival_rate = n_tracks / seeds_per_year

mean_survival_rate_per_year = survival_rate.mean().values

# What number times the mean_survival_rate * mean_seeds_per_year
#  should be multiplied by to get the number of events per year (CLIMO_TRACKS_PER_YEAR)?
constant_multi_per_year = (
    mean_seeds_per_year * mean_survival_rate_per_year
) / CLIMO_TRACKS_PER_YEAR

# Each seed_per_year should be multiplied by the constant_mult to get the number of events per year
inter_annual_variability = (
    seeds_per_year * (1 / constant_multi_per_year) * mean_survival_rate_per_year
)

# round to integers
inter_annual_variability = np.round(inter_annual_variability)

# For each year only retain the number of tracks in the inter_annual_variability
all_modified_ds = []
for i, year in enumerate(ds_tracks["year"]):
    # only retain the number of tracks in the inter_annual_variability
    ds_for_year = ds_tracks.sel(year=year, basin=BASIN, drop=True)

    # only retain tc_years = year
    ds_for_year = ds_for_year.where(ds_for_year["tc_years"] == year, drop=True)

    number_of_tracks_to_retain = int(inter_annual_variability.sel(year=year).values[0])

    # TODO: fix this
    # create number_of_tracks_to_retain random indices between 0 to 19  without replacement
    if number_of_tracks_to_retain > n_tracks:
        number_of_tracks_to_retain = n_tracks

    min_bound = (year.values - START_YEAR) * n_tracks
    max_bound = min_bound + n_tracks
    random_indices = np.random.choice(
        range(min_bound, max_bound), number_of_tracks_to_retain, replace=False
    )

    #  using the n_trk index to retain the number of tracks
    ds_for_year = ds_for_year.where(
        ds_for_year["n_trk"].isin(random_indices), drop=True
    )

    # put the year back
    ds_for_year["year"] = year

    all_modified_ds.append(ds_for_year)

# concatenate all the modified datasets
ds_tracks_modified = xr.concat(all_modified_ds, dim="year")

# Plot the intra-annual variability (per year)
fig, ax = plt.subplots(figsize=(10, 5))
inter_annual_variability.plot(ax=ax, marker="o", linestyle="none", color="k")
inter_annual_variability.rolling(year=5).mean().plot(
    ax=ax, color="k", linestyle="--", label="5-year rolling mean"
)
ax.set_xlabel("Year")
ax.set_ylabel("Number of Syn. Cyclones per year)")
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
# plot the sum of all the cyclones in the top right corner and indicate the basin
ax.text(
    0.9,
    0.9,
    f"Total: {int(inter_annual_variability.sum().values)}",
    transform=ax.transAxes,
    fontsize=12,
)
ax.text(0.9, 0.85, f"{BASIN} basin", transform=ax.transAxes, fontsize=12)
plt.minorticks_on()
plt.xticks(range(START_YEAR, END_YEAR, 4))
plt.xlim(START_YEAR, END_YEAR)
plt.xticks(rotation=45)
plt.yticks(range(0, 25))
ax.grid(True)
plt.title("Inter-annual variability of synthetic cyclones per year")
plt.axhline(
    y=CLIMO_TRACKS_PER_YEAR,
    color="r",
    linestyle="--",
    label=f"{START_YEAR}-{END_YEAR} Climatology",
)
ax.legend()
plt.savefig(
    f"inter_annual_variability_{BASIN}_{START_YEAR}_{END_YEAR}_v1.png",
    dpi=300,
    bbox_inches="tight",
)
plt.close()

# Plot the monthly distribution of synthetic cyclones
bins = np.arange(0.5, 12.6, 1)
bin_midpoints = bins[:-1] + 0.5
H = np.histogram(ds_tracks_modified.isel(time=0)['tc_month'], bins=bins)

nyears = (END_YEAR - START_YEAR) + 1
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(
    bin_midpoints, H[0] / nyears, color="k", width=0.8,
)
# above each bin indicate the number of events
for i, val in enumerate(H[0] / nyears):
    ax.text(
        bin_midpoints[i],
        val + 0.1,
        f"{val:.2f}",
        ha="center",
        va="bottom",
        color="k",
        bbox=dict(facecolor="white", alpha=0.5, pad=2),
    )
ax.set_xticks(range(1, 13))
ax.set_ylabel("Avg. Number of Synthetic Cyclones per month")
# put an xlabel indicating the basin 
ax.text(0.9, 0.9, f"{BASIN} basin", transform=ax.transAxes, fontsize=12)
ax.set_xticklabels(
    ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.yaxis.grid(True)
ax.set_title("Avg. Monthly distribution of synthetic cyclones \n 1979-2020")
plt.savefig(
    f"monthly_distribution_{BASIN}_{START_YEAR}_{END_YEAR}_v1.png",
    dpi=300,
    bbox_inches="tight",
)
plt.close()

# write out the modified dataset
ds_tracks_modified.to_netcdf(
    f"tracks_{BASIN}_{START_YEAR}_{END_YEAR}_modified_v1.nc"
)
print("Done")

linjonathan / tropical_cyclone_risk

Track anomalies explanation #4