CT.Rmd - Githubissues

jens-daniel-mueller commented 4 years ago

Hi Lara,

I'll post tasks here that should be coded within CT.Rmd. All of the following refers to data extracted for the NGS region (lat 57.5-58.8). Let's start with:

[x] Extract Finnmaid pCO2, S, and T data on standard track E with route label E, F, G, W, X (see explanation in pdf document), not only route label E as we did before in SST.Rmd
[x] Extract windspeed from GETM, need to be calculated from u and v component
[x] Calculate CT with code I provided in CT.Rmd. This requires Finmmaid pCO2, S, T
[x] Calculate air-sea CO2 fluxes with code I provided in CT.Rmd. This requires Finmmaid pCO2, T and GETM windspeed
[x] Add synchronized dygraphs for wind, CT, and CO2 flux

More to come, have fun & THANK YOU!

LSBurchardt commented 4 years ago

Hi Jens,

I went through the files again to check, what might need comments. Not as much as I thought in the end, especially not for the current audience. A few question marks in the CT file, that's it..

On the issue of headings and chunk names being the same: we could just number the chunks, if you want to avoid doublings, this wouldn't compromise understanding in my eyes the way the code works right now.

Another thing: could you tell me the ngs limits again? Currently, there are two different sets of numbers in the code (57.5 - 58.5 and 57.5-58.8)

jens-daniel-mueller commented 4 years ago

Hi Lara,

I've commented on your question marks in the code. However, I guess this doesn't tell you much, because all comments are related to calculations of seawater chemistry. Just accept those lines as black box for now.

Let's keep the chunk labels as they are. Numbering the chunks is not helpful, because figures that are automatically saved in docs/figure when knitting the document, are named after the chunk name.

Thanks for hinting me to the different lat limits. Let's stick to neither of both options and use 58.5 - 59.0 from now on. However, these limits will anyways be revised at a later stage.

jens-daniel-mueller commented 4 years ago

Task list for the comming 2 weeks as discussed today.

[x] Correct latitude limits for NGS region to 58.5 - 59 N throughout all Rmd files
[ ] Familiarize with tsibble objects / the tsibble package and explore the tidyverts package family. Implement functionality if helpful for our tasks.
some URLs to get started: https://tidyverts.org/ https://tsibble.tidyverts.org/ https://cran.rstudio.com/web/packages/tsibble/vignettes/intro-tsibble.html https://robjhyndman.com/hyndsight/tsibbles/

Within CT.Rmd

[x] Restrict further analysis to period March - September of each year
[x] Identify continous periods with without long gaps between observations (implement a variable threshold for the gaps, start with one week). Let’s name those periods without gaps "deployments" and assign a unique counter to each deployment. I've used this code before to identify deployments: deployment = cumsum(c(TRUE,diff(date_time)>=30) This code should increase the counter variable "deployment" by 1 after each gap of 30 sec
[x] To get an overview of the data coverage, plot: time of the year (Mar-Sep, x-axis) vs year (2003-2018, y-axis) + color or points to show deployment periods
[x] Identify periods with negative CT development, refered to as primary production periods "PPP"
Criteria for PPP:
- CT must be below CT_equi (CT_equi calculation was just implemented)
- must be within one deployment
- start at the earliest day followed by week in which CT decreased at least by 20
- end at the day in the week before which CT still decreased by at least 20
Please create a counter variable for PPP, similar to the counter for the deployments.
lead() and lag() might be useful functions
if required, we could switch to a daily interpolated CT data set, but let's discuss this first, if necesary.
[x] Plot CT timeseries for each year with color highlighting the identified PPP

Good luck ;-)

LSBurchardt commented 4 years ago

Hi Jens,

some follow-up questions on the tasks: I implemented the deployment values and tried to plot the data as we discussed. This is not trivial at all, the version you see now is closest to what we discussed. What do you think about it? Would only work out the details ( axis limits, colors, etc.) when you think this sufficient or adequate in showing what we want.

So far I implemented this without the tsibble package, but I am still checking it out, especially for the next task we might need it.

jens-daniel-mueller commented 4 years ago

Hi Lara, your general approaches were OK. Here is what I changed and some general recommendations for future tasks

Use tidyverse write_csv() and read_csv() rather than write.csv() and read.csv(). Writing and reading objectes works very well when both functions are used consistently, in particular column types are recognized well . Also, it creates tibble rather than data.frame objectes, which is what we want.
[x] Switch to write_csv() and read_csv() throughout all Rmd's, use vroom only to read large data files and when no problems result

For consistency, use tidyverse pipe and tibble handling whenever possible, eg:

gt_mld_fm_pco2_ngs <- gt_mld_fm_pco2_ngs %>% 
mutate(date = as.Date(date))]

rahter than

gt_mld_fm_pco2_ngs$date <- as.Date(gt_mld_fm_pco2_ngs$date)

also when creating new data sets, use tibble rather than data.frame
I did not understand why you created and merged with generic_date when identifying the deployments. I only commented out the respective code, but if you agree that's not required, it can be deleted.
plot is OK now, you can continue with the PPP identification

Good luck ;-)

jens-daniel-mueller commented 4 years ago

Hi Lara,

I think your approach to identify PPP could work, however, here are some comments:

I would have expected the analysis to start with something likedf %>% group_by(year, deployment) %>% … rather than your loops, but this difference might only be “style”
There is a difference in how we interpret the end condition. In my initial instruction, I wrote “end at the day in the week before which CT still decreased by at least 20”, and you translated it to “- ends at day where this is not true for the next week”. Do you see the difference? My idea was that the end criterion should be looking backwards in time. The PPP should last from the day when the start criterion was first met until the day when the end criterion is finally achieved.
In the end there should be two independent sets of criteria for start and end conditions (e.g. decrease_start, decrease_end, timespan_start, timespan_end)
Days for which the start criterion holds true are currently labeled with x, so even two consecutive days which meet the start criterion get different values, although they belong to the same PPP. I hope this can be changed in the post-processing when NA’s will be removed.

Let’s talk if the comments are not entirely clear to you!

LSBurchardt commented 4 years ago

Hi Jens,

thanks for your thoughts, a few comments on that:

group_by or loop: I wanted to use group_by first, but turned out to be easier and faster to get what we want with loops
I added a second block for the end criterion, so criteria could differ
we now get a data table including all observations, that belong to a ppp in our respective time frame; problem: the enumeration is still "random", do you see an easy way to use cumsum to enumerate from this? (I did not yet fully understand the cumsum function, so it might be faster if you already have an idea)
functionality is given, it is reasonably fast but there is definitely smarter ways to get to the results, let me know how much time we want to invest in getting a more "pretty" way

Let me know what you think.

jens-daniel-mueller commented 4 years ago

Hi Lara, I’ve added some lines of code to create interim data frames and plots, which help to diagnose the outcome of your code. The revised Rmd was knitted to html and pushed, so should be online by now. Here are a few questions that arose:

Why does fm_ppp have so many (60.000 out of 65.000) duplicates?
Rows in fm_ppp do not appear ordered by date. Is this supposed to be like that?
After removing duplicates and arranging by date, I find NA in between identical PPP numbers. This is not supposed to be the case. One PPP can only start and end, but not be interrupted by NA.
Also, PPP does not increase with year. Is that OK?
Plotting CT time series colored by PPP, I’ve the impression that the identification does not work correctly, and I assume this has to do with the answers to the questions above.

Let’s try to address this questions first. Once this is done, the conversion of your PPP to meaningful numbers shouldn’t be a problem. /Jens

LSBurchardt commented 4 years ago

Hi Jens,

thanks for your comments. The plots look nice, some of the colorings will probably change, ones the ppps are properly enumerated. To your questions:

Duplicates: The way the last loop works right now, a duplicate is written for every row in a deployment, forwards and backward. Since removing duplicates workes so well and fast, I left it like that for the time being, one could probably also match the rows when binding together.
that is also why dates are not ordered at first, because of the duplicates for the same deployments
NAs between ppps are ok, as I said, the numbers of ppps now are just the index of the row in the deployment they start, so 2 and 2 are not the same ppp when separated by NAs (should be different deployments also)
this is also the reason, that ppp numbers don't necessarily increase with year.

I hope that answers your questions and we can start the conversion. /Lara

LSBurchardt commented 4 years ago

Hi Jens,

a few - if not all - of the problems we talked about are now solved.

ppps are enumerated correctly
the minimum CT value in a 7 day look forward is used as identifier, not the CT value of the latest existing date in 7 day look forward
end-criteria are only tested, when there was a start

The data looks more reasonable now. I carefully checked 2016, all seems to be correct. There is some weird point formations left, for example in 2004. They do meet all the criteria, but with very fast and strong changes up and down within a week, all five points are regarded as one ppp. Have a look and let me know, whether we need another criteria to prevent something like that.

Hope everything is clear, otherwise, lets talk. Cheers Lara

jens-daniel-mueller commented 4 years ago

Hi Lara,

your latest version does not produce exactly the same PPP as the version I knitted yesterday, which is still online as html. It seems that some period which were seperate PPP in yesterdays version, now fall into one group (e.g. 2018). Maybe this has to do with the new positioning of the rm() function. Can you please run your current version again locally, compare the outcome to the online version and figure out which approach is correct?

Concerning your appraoch to re-assign ppp for fm_ppp_final: Isn't it possible that two PPPs are identified, that do not have a gap > 7 days between the end of PPP1 and start of PPP2? In this case your approach would combine two seperate PPPs, right?

Also, can you please try to assign PPP that start from 1 each year? This will make it easier to distuinguish colors in plots. (Use group_by(year) during assignment; compare to assignement of deployment numbers)

LSBurchardt commented 4 years ago

Hi Jens,

I checked the data again. No, it has nothing to do with the rm() function, but with a calculation in the end condition, that was wrong before. We had switched to Minimum CT in 7 days forward, but it was still maximum date in 7 days backward, which didn't made sense anyway. I changed it to minimum CT in 7 days backward for the last push, which I now realize also doesn't make sense, so no it is maximum CT in 7 days backward for the end criterion, to match the start criterion. These results should be correct.

To the reassignment of PPPs: yes, that is true. I didn't think about that before. I tried some other options, that did not work and now just re-ran the code with a 4-day gap instead of 7 to check what the difference would be. There is a difference in the later years with long deployments and many days within ppps, maybe check the differences again and we need to just decide for a ppp_gap here. I can't think of another way to do it right now...

Apart from that, PPPs start at 1 for each year now.

jens-daniel-mueller commented 4 years ago

Hi Lara, two comments/requests after revising, knitting and pushing your last changes:

when you use group() always ungroup(), otherwise the data set remains in grouped modus for subsequent tasks which can easily produce severe errors.
results with the new ppp_gap look fine at the moment, but do not allow the flexibility I want for later analysis when I'll play around with start and end conditions a lot. So, the aim must be to find a solution without ppp_gap. The underlying problem is the labeling procedure in the first identification step, which labels one ppp with different numeric values. Would it somehow be possible to assign one unique ppp in the first step? This would make life much easier afterwords.

You can always call to discuss approaches to this task.

LSBurchardt commented 4 years ago

Hi Jens,

the new idea is pushed. Should enumerate reliably now, if there are still concerns, lets talk later.

jens-daniel-mueller commented 4 years ago

Hi Lara, thanks for the update, technically it looks clean now!

In the attached file I've marked a few periods, where I don't really understand why they are identified or not. Let's discuss this before we go on.

CT_timeseries_yearly_PPP-1_manual_check

jens-daniel-mueller commented 4 years ago

Hi Lara, I'm happy that you agreed to make final revision of the code. I expect that after that revision we're at a stage perfect to hand things over.

Here is an update after our last call:

Remaining issues

single points identified as PPP
interupted ppp with identical counter
too early stop of PPP

Tasks/questions (potentially) affecting functionality

581, 582, and 584: Can those lines be deleted?
carefully check if()-conditions: Do we always select the correct start and end combination?
614: Avoid to look back further than start of PPP! Maybe do only assign df_ppp_temp$ppp[x] = as.numeric(a)

Tasks/questions consering code style:

avoid naming objects as "_temp" or "df", try to be as precise about content as possible
527, 534: unnecessary repetition of a <- 1
525: replace by pull(max(deployment))
532: replace 1:deployment by unique(df_ppp$deployment), than avoid 539
547: should be identical to if (end == "stop"), because start can only be either "stop" or "go"
564: Could be replaced by unique identifier, to avoid the (unlikely) case that the exact CT value exists twice

CT_timeseries_yearly_PPP-1_2010-04-23

jens-daniel-mueller commented 4 years ago

Hi Lara, thank you so much for revising the code again! Indeed, taking a first look, the output seems perfect now. I'm curious about the outcome when I'll change the ppp conditions...

Two quick questions:

584: if (start == "go" | start == "stop" & end == "go"| start == "go" & end == "go") isn't this the same as if (start == "go" | end == "go")?
do you intend to do also the "cosmetic" revision of the code, or should I do it? This is really just a question to avoid merge conflicts, not a plea...

Cheers!

LSBurchardt commented 4 years ago

Hi Jens,

awesome! Happy to hear that.

584: yes, that should be true. This can be changed.
I will leave the cosmetics to you.

If anything else comes up, let me know and we can see how deal with it.

Cheers!

jens-daniel-mueller / Baltic_Productivity

CT.Rmd #3