hurlbertlab / core-transient

Data and code for NSF funded research on core vs transient species
7 stars 3 forks source link

general question about converting dates to decimal-year #23

Closed mauriemma closed 9 years ago

mauriemma commented 9 years ago

In changing months+years to decimal-year by dividing the month by 12 then added the value to the actual year, am I supposed to leave December observations as 1, thus adding 1 to the year of the observation?

ahhurlbert commented 9 years ago

@mauriemma I would add (month - 1)/12 to the year, so that Jan = 2014.0
Feb = 2014.08
Mar = 2014.17
...
Dec = 2014.92

ethanwhite commented 9 years ago

This is probably something that's been decided elsewhere, in which case my apologies for jumping in, but this sort of conversion of year and month into a single decimal value seems like it might be a bit confusing. Is there a reason not to either store month and year in separate columns or use an ISO standard date format, e.g., '2014-01', '2014-02'...

bsevansunc commented 9 years ago

My thinking on the decimal years essentially refers to ease of working with date data. I can see using YYYY-mm if samples were compiled monthly (or YYYY for annually sampled sites). There are datasets, however, in which sites were sampled seasonally (e.g., dataset 223, Lightfoot's small mammal study). In these cases, coding as a year-month would be problematic and sampling data need to be summarized by season (with dates rounded to season). I could see coding it as, for example, "2003.fall", but this would mean that our dates would not be in the same format across datasets. In other instances, some datasets are sampled at a sub-month temporal grain (such as bi-weekly or weekly samples), The Landis insect dataset (2014), in which data were collected weekly but on different days, is an example of this -- in cases such as this, data need to be summarized by week (with dates rounded to week). To me, using decimal date provides a way that dates can be formatted in a uniform way across datasets. I can see, however, that if one wanted to use our formatted dataset in a different way than our current use that this would render them not-very-usable and they would have to go to the raw data to get the original dates. Alternatively, we could write the dates using the standard format but this would mean that the function that creates the proportional occurence data frame would need to be written in a way that summarizes dates differently based on the temporal grain of the study. Thoughts?

brymz commented 9 years ago

I would suggest that you reconsider condensing so much info into one column. With computing power where it is today, the number of columns will not have a large impact on the analysis run time. I would recommend something like; three columns for year, sampling frequency, and sampling event (as a number from 1 to total sampling events) OR two columns for year, and sampling day (in Julian days). Julian days step away again from the standard ISO but is useful because 365 is a common denominator when combining monthly and weekly data. Either way, the goal would be to have data that is suitable for sharing, but also allow you to have the info you need to run your analysis.

On Wed, Feb 4, 2015 at 10:03 AM, Brian S Evans notifications@github.com wrote:

My thinking on the decimal years essentially refers to ease of working with date data. I can see using YYYY-mm if samples were compiled monthly (or YYYY for annually sampled sites). There are datasets, however, in which sites were sampled seasonally (e.g., dataset 223, Lightfoot's small mammal study). In these cases, coding as a year-month would be problematic and sampling data need to be summarized by season (with dates rounded to season). I could see coding it as, for example, "2003.fall", but this would mean that our dates would not be in the same format across datasets. In other instances, some datasets are sampled at a sub-month temporal grain (such as bi-weekly or weekly samples), The Landis insect dataset (2014), in which data were collected weekly but on different days, is an example of this -- in cases such as this, data need to be summarized by week (with dates rounded to week). To me, using decimal date provides a way that dates can be formatted in a uniform way across datasets. I can see, however, that if one wanted to use our formatted dataset in a different way than our current use that this would render them not-very-usable and they would have to go to the raw data to get the original dates. Alternatively, we could write the dates using the standard format but this would mean that the function that creates the proportional occurence data frame would need to be written in a way that summarizes dates differently based on the temporal grain of the study. Thoughts?

— Reply to this email directly or view it on GitHub https://github.com/hurlbertlab/core-transient/issues/23#issuecomment-72893753 .

ahhurlbert commented 9 years ago

Thanks for the feedback from both of you and I appreciate both @ethanwhite and @brymz 's intention of making things easily sharable for others and others' uses. That said, I believe these formatted datasets are already a bit down the pipeline toward some specific analyses we have in mind for which we have already had to make decisions that should not be embraced without some thought. This means that anyone wanting to use these datasets for different purposes should probably be using the raw data and thinking through these decisions themselves, rather than blindly using our formatted datasets.

Any particular sampling event might have a date associated with it, but one decision we are making is to analyze assemblages aggregated over some temporal scale, e.g. Portal data into 6-month bins. So what is the date you want to assign to Jan-June versus July-December? The coding of this will be arbitrary, and we've chosen to do this in decimal years. What is the downside? It may be slightly confusing, but any other approach invites some confusion as well. For example, in using a YYYY-MM-DD format, will datasets with annual samples be listed as occurring on January 1st?

We're happy to revisit this issue and hear further arguments, but I see downsides to every approach mentioned so far and view this as pretty low priority.

ethanwhite commented 9 years ago

I guess the take home here is that these don't really represent dates in any meaningful sense since they are simultaneously integrating a variety of different things. For the purposes of our analyses I guess they're really just the sampling periods that we have defined for each dataset. Is that fair to say?

ahhurlbert commented 9 years ago

Yes, fair to say. Having said all this, @bsevansunc just suggested having our "formatted datasets" be something a bit more generic and upstream in the workflow than they currently are which would maintain all info at the finest spatial and temporal grains of sampling (including raw dates and site ids). In this case, the "formatting" would be cleaning up some very basic issues with the data, and these datasets could be of use to the broader community.

The downside is that for our analysis purposes, we then have two separate sets of scripts that are dataset specific. Each dataset's "cleaning script", and then each dataset's "scale decisions" script. In many cases, the latter doesn't have to be a script necessarily, but can be scale arguments that are passed into our functions for calculating proportional occupancy from the metadata table. But for a number of datasets, they are configured in such a way that a generic solution here would be quite difficult. For example, how would you know which Sevilleta plant quadrats are grouped together into the same plot into the same trapping web, etc. You're going to need a dataset-specific script.

So, really what we're talking about is creating a secondary intermediate data product relative to what we're doing now, primarily for uses outside of this project. I can see this as a plausible way forward, but am a little fuzzy on the costs/benefits.

ethanwhite commented 9 years ago

Since we're really just using date here as sampling period and we're having a script do that conversion for us based on the raw data, which makes it reproducible/understandable, I think it's that fine for the sampling period column to have something arbitrary in it. The key is just for all of us to keep track of that fact that that column is really more ordered sampling periods than dates per se. I should have thought through this a little more carefully before jumping in.

brymz commented 9 years ago

I'll echo @ethanwhite. It's good for us to be on the same page. No reason to make more work beyond that.