CDO an NETCDF Part II - Githubissues

kdorheim commented 4 years ago

Background Notes

CDO

Before we get to the tasks here is a bit of information about the different ways one can use cdo.

cdo in command line
cdo in a bash script
execute the cdo commands in R via the system2 function (system2 documentation)

I prefer execute the cdo commands in R, in my opinion this set up is more reproducible, allows for defensive programing, diagnostic tests, and validity checks. If you are more comfortable with setting up a bash script we can talk about doing that but for now I would like us to focus on understanding cdo from command line and in R.

cdo uses the following syntax cdo operator_name in_file_nc out_file_nc where

operator_name is the cdo operator name, there are a boat load of these and they are all listed in the documentation manual, the one we will be working with the most is called fldmean.
in_file_nc is the path to the netcdf or nc file to process
out_file_nc is the nc file that will be generated.

Netcdfs and R

Hints for working with netcdfs in R

Use the ncdf4 package, the function nc_open, ncvar_get, and ncatt_get are useful
Understanding the structure of the data is important, use functions like str, dim, length, and head to get an idea of what the data looks like without printing out the whole thing
The extracted netcdf data is organized in lists, the apply family of functions (cough cough lapply) make applying a function to a list really easy.

(Please come to me with questions about any of these functions, stack over flow has lots of helpful information. Documentation for R functions are available online and in R use help("function_name") for R documentation.)

Plotting in R

Most intro stats class use base R plots to visualize the results. We will be using ggplot2, the grammar rules can be funky at first, let me know if you have questions, Steph is also a good ggplot resource (she is a data viz wizard). FYI ggplot syntax works best with long formatted data.

kdorheim commented 4 years ago

Exercises

[x] What is does the fldmean cdo operator do?
[x] Apply the fldmean operator to the nc file you downloaded last week, what syntax did you use?
[x] Plot the new netcdf in Panoply, based on what you know about fldmean was this what you were expecting?

In R set up a script (or if you want a mark down document to answer the following)
[x] Import the original netcdf data into R
[x] Describe the data, what's the structure? The R class? What happens when you try to look at the data all at once (print it out)?
[x] Find the annual global average temperature (hint try using lapply here)
[x] Import the nc file you created using cdo. What does the structure of that data look like?
[x] Compare the results of the new cdo with the annual global average temperature you calculated in R, are they identical, similar, or different? This is a good opportunity to practice plotting or using the function summary to check out summary stats.

Up next, we will start playing around with executing cdo via the R system2 call

skygering commented 4 years ago

The fldmean cdo operator calculates the mean value of all of the grid cells. It uses the area weighted mean for each grid cell in a field.

Problem CDO doesn't seem to be able to open netCDF-4 files. The files that I got to open were CF-convention files. I read online that netCDF-4 files are unzipped by CDO. I am wondering if that isn't happening for some reason?

bpbond commented 4 years ago

NetCDF 4 is ten years old and I'm pretty sure CDO should handle it fine. Can you provide a reproducible example?

kdorheim commented 4 years ago

@skygering Was the issue just with the example data I had you pull from fldgen? Does it work this data? (I think you are going to have to unzip it first) tas_Amon_NorESM2-LM_1pctCO2_r1i1p1f1_gn_000101-001012.nc.zip

skygering commented 4 years ago

Answers to above questions:

The fldmean cdo operator calculates the mean value of all of the grid cells. It uses the area weighted mean for each grid cell in a field.
cdo fldmean tas_annual_ipsl-cm5a-lr_rcp8p5_xxx_2006-2099.nc tas_annual_mean.nc -The above command created a new file where each timestep has a mean surface temperature. Based on what fldmean does, this is the average temperature of the entire land area. There are no lat/lon variables anymore in the new file, just time step and average temperature. Here is the graph I got! The world is heating up as expected...
I started by storing the data in a variable f. When I ran print(f), I got a detailed list of the variables and dimensions. The variable mean global temperature has the variables latitude, longitude, and time. There were 720 longitude values, 360 latitude values, and 94 time variables. This is equal to 24,364,800 individual points of data for the global mean temperature from 2006-2099. Longitude has units degrees east, latitude has units degrees north, time has units %Y%m%d as a long string, and tas has units Kelvin. There is also another variable time_bnds which seems to be all of the time units in list form. The file is a object of class ncdf4 and the other variables are all arrays.
I found the global annual temperature using apply since this let me split the 3D array into 'slices' with time and then average over each slice. I typed mean_tas <- apply(tas, 3, function(x) mean(x, na.rm = TRUE))
I don't know how to use lapply since it needs a list. This suggests to me that the data needs to be modified before using, maybe with the function split(), but I wasn't sure. While this data looks really similar in shape to the previous data it isn't exactly the same. The maximum temperature is a lower on the map made in R The data on R isn't weighted by a land map like the CDO function is. For this I think I need a land map.

-I am also struggling with using gglot to plot without using a pipeline because I need to input a data frame but I don't quite understand what that is. I just want to put in the x and y values, but clearly ggplot2 needs a bit more fineness. Since apply returns just a list, and it seems like I need a more complicated data structure for ggplot2, I am not sure what to do.

I will ask Ben questions tomorrow morning at our meeting!

kdorheim commented 4 years ago

@skygering great work!

You are right apply is the best function to work with here, latter on though we will be using lapply because it will let us use a single function to process data from lots of different models. BTW you used apply perfectly! I hope that didn't take you too long to figure out (I had intentionally left out some info in hopes that you'd come ask for help but looks like you were able to figure it out).

As for getting the data into a data frame, you can make a data frame out of vectors!

So something like this.

# Calculate the mean tas 
mean_tas <- apply(data, 3, function(x) mean(x, na.rm = TRUE))

# Extract the time information
time <- ncvar_get(nc, 'time')

# Format the time and temp vectors into a data.frame
df <- data.frame(time = time, 
                 value = mean_tas)

Something that might be helpful to work on would be as @bpbond mentioned to set up reproducible samples of code. This can help debugging. Also for the scripts that you write related to these learning activities let's save them to this repo in a directory called scratch. To give you some practice with working with git, pull request, and reviews. As always I am happy to talk about any questions you may have and @bpbond is a great person to talk this over with as well.

skygering commented 4 years ago

I finished everything and made the directory in my repo! Here are the graphs for the weighted (from CDO) and unweighted (from R) data. weighted_mean_tas_gg unweighted_mean_tas_gg

Even though it was clear from the graphs that the two data sets are not the same, I still used identical() and all.equal() to check to make sure I knew how to use them. All of my work from this exercise is in the tas_cdo.R file!

I do have a github questions. When I made the scratch directory, I made a branch and then cloned the branch onto my local desktop to put my files into it. I now realize that the github URL is the same no matter which branch, so when I pushed my files up it went into the master branch rather than my new branch. While that doesn't really matter in this case, I was wondering if there is a way to clone or push to a specific branch?

Ready for the next part!

bpbond commented 4 years ago

👏 nice work @skygering - love the graphs

git clone clones an entire repository, including all branches (which may have different remotes, though I've never done this)

Pushing is by definition branch-specific.

skygering / land-ocean-warming-ratio

CDO an NETCDF Part II #2

Background Notes

CDO

Netcdfs and R

Plotting in R

Exercises