swulfing / STOICH.Aim1

2 stars 2 forks source link

Suspected problem with tidy data set #4

Open diatomdaniel opened 2 years ago

diatomdaniel commented 2 years ago

Hi all (and particularly @LinneaRock and project head-honcho @mcstreamy ), I have discovered an issue re. the tidy version of the data set (i.e. all results in one column (RESULTS) with another column providing the variable name (VARIABLE): the data is not uniquely identified in this format. For example when I tried to pivot from a long to wide data format, duplicate entries per site and time are created that need to be summarised/averaged. e.g. after loading the ALL_CNP_VARs data set, consider the following code (requires tidyverse package to be loaded)

master_wide <- ALL_CNP_VARS %>% pivot_wider(id_cols = c(DATE_COL, SITE_ID, UNIT, LAT, LON, ECO_TYPE), names_from = VARIABLE, values_from = RESULT)

compared to

master_wide <- ALL_CNP_VARS %>% pivot_wider(id_cols = c(DATE_COL, SITE_ID, UNIT, LAT, LON, ECO_TYPE), names_from = VARIABLE, values_from = RESULT, values_fn = mean)

I suspect this behaviour occurs because the date column is contains YY-MM-DD but not hourly information? Regardless, this is an issue we should address soon. Please note that I am not trying to criticise previous work/efforts so far or the validity tidy data format; I just think this is a data issue we need to address for consistent results going forward....Also, maybe I am being an idiot and this is already accounted for. If so, please let me know.

Thanks

LinneaRock commented 2 years ago

Hey! Ahh, I actually have not noticed this issue. But it totally makes sense that duplicates could be in there for some of the sites, especially LTER sites I'm guessing. I think that taking the average of values over the same date seems appropriate. Thank you for sharing this info!

diatomdaniel commented 2 years ago

Ok, let's bring this up at the meeting today. I don't like the idea of averaging over the long-term data; I think we should treat all observations as equal and avoid averaging where possible. Especially if we want to utilise the temporal aspect of some of t he data...neglecting it would be a major caveat especially if these lakes, rivers have undergone sign. trends. Perhaps adding another variable called SEASON (or some other information) would help us avoid this issue?

LinneaRock commented 2 years ago

Oh, I thought you meant duplicates within the same date?

diatomdaniel commented 2 years ago

I think so but need to do some more digging to confirm; will do over the next week and then either close or update this issue

diatomdaniel commented 2 years ago

Found duplicates per site and sampling date. Nr. of duplicates range from 2 to 38; mainly streams/rivers; 57% have 2 duplicates, approx. 5% have more than 10 observations. The distribution of the data for the duplicates are weird but see for yourselves Code/Checking duplicates.R. You'll need to update file paths though...

diatomdaniel commented 2 years ago

@mcstreamy can you add this to Friday's discussion please? I'll send you an html with the figures from the analysis if we are going t o discuss.

mcstreamy commented 2 years ago

yes sire @diatomdaniel

LinneaRock commented 2 years ago

@diatomdaniel @mcstreamy -- I may have missed this conversation when I was late to our last BGC team meeting, but did this get resolved?

mcstreamy commented 2 years ago

Ah, whoops. I actually didn't bring it up during the last meeting. Have you guys already determined a way to move through this or should we discuss Friday? @diatomdaniel @LinneaRock @dnguyen2017