Open jens-daniel-mueller opened 4 years ago
Hi Jens,
I went through the files again to check, what might need comments. Not as much as I thought in the end, especially not for the current audience. A few question marks in the CT file, that's it..
On the issue of headings and chunk names being the same: we could just number the chunks, if you want to avoid doublings, this wouldn't compromise understanding in my eyes the way the code works right now.
Another thing: could you tell me the ngs limits again? Currently, there are two different sets of numbers in the code (57.5 - 58.5 and 57.5-58.8)
Hi Lara,
I've commented on your question marks in the code. However, I guess this doesn't tell you much, because all comments are related to calculations of seawater chemistry. Just accept those lines as black box for now.
Let's keep the chunk labels as they are. Numbering the chunks is not helpful, because figures that are automatically saved in docs/figure when knitting the document, are named after the chunk name.
Thanks for hinting me to the different lat limits. Let's stick to neither of both options and use 58.5 - 59.0 from now on. However, these limits will anyways be revised at a later stage.
Task list for the comming 2 weeks as discussed today.
[x] Correct latitude limits for NGS region to 58.5 - 59 N throughout all Rmd files
[ ] Familiarize with tsibble objects / the tsibble package and explore the tidyverts package family. Implement functionality if helpful for our tasks.
some URLs to get started: https://tidyverts.org/ https://tsibble.tidyverts.org/ https://cran.rstudio.com/web/packages/tsibble/vignettes/intro-tsibble.html https://robjhyndman.com/hyndsight/tsibbles/
Within CT.Rmd
[x] Restrict further analysis to period March - September of each year
[x] Identify continous periods with without long gaps between observations (implement a variable threshold for the gaps, start with one week).
Let’s name those periods without gaps "deployments" and assign a unique counter to each deployment.
I've used this code before to identify deployments: deployment = cumsum(c(TRUE,diff(date_time)>=30)
This code should increase the counter variable "deployment" by 1 after each gap of 30 sec
[x] To get an overview of the data coverage, plot: time of the year (Mar-Sep, x-axis) vs year (2003-2018, y-axis) + color or points to show deployment periods
[x] Identify periods with negative CT development, refered to as primary production periods "PPP"
Criteria for PPP:
Please create a counter variable for PPP, similar to the counter for the deployments.
lead() and lag() might be useful functions
if required, we could switch to a daily interpolated CT data set, but let's discuss this first, if necesary.
[x] Plot CT timeseries for each year with color highlighting the identified PPP
Good luck ;-)
Hi Jens,
some follow-up questions on the tasks: I implemented the deployment values and tried to plot the data as we discussed. This is not trivial at all, the version you see now is closest to what we discussed. What do you think about it? Would only work out the details ( axis limits, colors, etc.) when you think this sufficient or adequate in showing what we want.
So far I implemented this without the tsibble package, but I am still checking it out, especially for the next task we might need it.
Hi Lara, your general approaches were OK. Here is what I changed and some general recommendations for future tasks
Use tidyverse write_csv()
and read_csv()
rather than write.csv()
and read.csv()
. Writing and reading objectes works very well when both functions are used consistently, in particular column types are recognized well . Also, it creates tibble rather than data.frame objectes, which is what we want.
[x] Switch to write_csv() and read_csv() throughout all Rmd's, use vroom only to read large data files and when no problems result
For consistency, use tidyverse pipe and tibble handling whenever possible, eg:
gt_mld_fm_pco2_ngs <- gt_mld_fm_pco2_ngs %>%
mutate(date = as.Date(date))]
rahter than
gt_mld_fm_pco2_ngs$date <- as.Date(gt_mld_fm_pco2_ngs$date)
also when creating new data sets, use tibble rather than data.frame
I did not understand why you created and merged with generic_date
when identifying the deployments. I only commented out the respective code, but if you agree that's not required, it can be deleted.
plot is OK now, you can continue with the PPP identification
Good luck ;-)
Hi Lara,
I think your approach to identify PPP could work, however, here are some comments:
I would have expected the analysis to start with something likedf %>% group_by(year, deployment) %>%
… rather than your loops, but this difference might only be “style”
There is a difference in how we interpret the end condition. In my initial instruction, I wrote “end at the day in the week before which CT still decreased by at least 20”, and you translated it to “- ends at day where this is not true for the next week”. Do you see the difference? My idea was that the end criterion should be looking backwards in time. The PPP should last from the day when the start criterion was first met until the day when the end criterion is finally achieved.
In the end there should be two independent sets of criteria for start and end conditions (e.g. decrease_start, decrease_end, timespan_start, timespan_end)
Days for which the start criterion holds true are currently labeled with x, so even two consecutive days which meet the start criterion get different values, although they belong to the same PPP. I hope this can be changed in the post-processing when NA’s will be removed.
Let’s talk if the comments are not entirely clear to you!
Hi Jens,
thanks for your thoughts, a few comments on that:
Let me know what you think.
Hi Lara, I’ve added some lines of code to create interim data frames and plots, which help to diagnose the outcome of your code. The revised Rmd was knitted to html and pushed, so should be online by now. Here are a few questions that arose:
fm_ppp
have so many (60.000 out of 65.000) duplicates?fm_ppp
do not appear ordered by date. Is this supposed to be like that?Let’s try to address this questions first. Once this is done, the conversion of your PPP to meaningful numbers shouldn’t be a problem. /Jens
Hi Jens,
thanks for your comments. The plots look nice, some of the colorings will probably change, ones the ppps are properly enumerated. To your questions:
I hope that answers your questions and we can start the conversion. /Lara
Hi Jens,
a few - if not all - of the problems we talked about are now solved.
The data looks more reasonable now. I carefully checked 2016, all seems to be correct. There is some weird point formations left, for example in 2004. They do meet all the criteria, but with very fast and strong changes up and down within a week, all five points are regarded as one ppp. Have a look and let me know, whether we need another criteria to prevent something like that.
Hope everything is clear, otherwise, lets talk. Cheers Lara
Hi Lara,
your latest version does not produce exactly the same PPP as the version I knitted yesterday, which is still online as html. It seems that some period which were seperate PPP in yesterdays version, now fall into one group (e.g. 2018). Maybe this has to do with the new positioning of the rm() function. Can you please run your current version again locally, compare the outcome to the online version and figure out which approach is correct?
Concerning your appraoch to re-assign ppp for fm_ppp_final: Isn't it possible that two PPPs are identified, that do not have a gap > 7 days between the end of PPP1 and start of PPP2? In this case your approach would combine two seperate PPPs, right?
Also, can you please try to assign PPP that start from 1 each year? This will make it easier to distuinguish colors in plots. (Use group_by(year) during assignment; compare to assignement of deployment numbers)
Hi Jens,
I checked the data again. No, it has nothing to do with the rm() function, but with a calculation in the end condition, that was wrong before. We had switched to Minimum CT in 7 days forward, but it was still maximum date in 7 days backward, which didn't made sense anyway. I changed it to minimum CT in 7 days backward for the last push, which I now realize also doesn't make sense, so no it is maximum CT in 7 days backward for the end criterion, to match the start criterion. These results should be correct.
To the reassignment of PPPs: yes, that is true. I didn't think about that before. I tried some other options, that did not work and now just re-ran the code with a 4-day gap instead of 7 to check what the difference would be. There is a difference in the later years with long deployments and many days within ppps, maybe check the differences again and we need to just decide for a ppp_gap here. I can't think of another way to do it right now...
Apart from that, PPPs start at 1 for each year now.
Hi Lara, two comments/requests after revising, knitting and pushing your last changes:
when you use group()
always ungroup()
, otherwise the data set remains in grouped modus for subsequent tasks which can easily produce severe errors.
results with the new ppp_gap
look fine at the moment, but do not allow the flexibility I want for later analysis when I'll play around with start and end conditions a lot. So, the aim must be to find a solution without ppp_gap
. The underlying problem is the labeling procedure in the first identification step, which labels one ppp
with different numeric values. Would it somehow be possible to assign one unique ppp
in the first step? This would make life much easier afterwords.
You can always call to discuss approaches to this task.
Hi Jens,
the new idea is pushed. Should enumerate reliably now, if there are still concerns, lets talk later.
Hi Lara, thanks for the update, technically it looks clean now!
In the attached file I've marked a few periods, where I don't really understand why they are identified or not. Let's discuss this before we go on.
Hi Lara, I'm happy that you agreed to make final revision of the code. I expect that after that revision we're at a stage perfect to hand things over.
Here is an update after our last call:
Remaining issues
Tasks/questions (potentially) affecting functionality
Tasks/questions consering code style:
Hi Lara, thank you so much for revising the code again! Indeed, taking a first look, the output seems perfect now. I'm curious about the outcome when I'll change the ppp conditions...
Two quick questions:
584: if (start == "go" | start == "stop" & end == "go"| start == "go" & end == "go")
isn't this the same as if (start == "go" | end == "go")
?
do you intend to do also the "cosmetic" revision of the code, or should I do it? This is really just a question to avoid merge conflicts, not a plea...
Cheers!
Hi Jens,
awesome! Happy to hear that.
If anything else comes up, let me know and we can see how deal with it.
Cheers!
Hi Lara,
I'll post tasks here that should be coded within CT.Rmd. All of the following refers to data extracted for the NGS region (lat 57.5-58.8). Let's start with:
More to come, have fun & THANK YOU!