Break point analysis - Githubissues

IslaMS commented 5 years ago

Hey @izzyrich

Here is a website explaining how to do a break point analysis for a time series as we discussed in our meeting yesterday. I think there are a few different packages that you can use to do this, but this one has nice code examples. This could be a way to statistically and visually test whether the area of your different land cover types changes in association with your SEP events.

https://rpubs.com/MarkusLoew/12164

Isla

izzyrich commented 5 years ago

Hi @IslaMS,

Thanks for the link. I've been trying it out, but I'm not sure it makes sense for my data:

It doesn't seem like it works with a LMM structure? I have summed the area for each year instead. I visualised the change in different regions and, at least for abandoned, it is the same general trend.
As you can see, there are potentially a lot of breakpoints - it's very up and down, with sharp increases and decreases. If I apply the breakpoints as the 2 events, a lot of the information is lost and the relationship is simplified.

Do you recommend: (a) going forward with the breakpoint analysis, and visually looking to choose the breakpoints e.g. not just having the events? (b) proceeding with the moving window method we spoke about before (c) something else?

I like the visualisation aspect of the tutorial, but I'm not sure it makes the most sense statistically. Let me know what you think. Just for clarification, I am trying to use that for the time lag question.

I have also redone the statistics for the first 2 questions and have found some interesting and exciting results! You were correct in that switching to this method fixed my issues with meeting the assumptions. My model outputs can be found here.

It seems like the most intriguing story is in relation to intensive land, so I've decided to focus more on that. One sort of curious thing is that abandoned to intensive and intensive to abandoned undergo change in the same direction, which seems odd. This is true with transitions to/from extensive and intensive too. I've had a think about it and I think it could make sense ecologically and socially, but it definitely wasn't what I was expecting!

I'll finish up my methods now and then start fixing my graphs and writing my results for questions 1 and 2. Let me know what you think about how best to approach the last question about time lags.

Thank you and happy Sunday!

Izzy

IslaMS commented 5 years ago

Hi @izzyrich

That is exciting about your new results! I think that it is a good idea to explore the intensive land transitions in greater detail, but don't forget to respond to your initial hypotheses about the other land cover types as well through your analysis.

Break point analysis is a separate analysis - it is not a hierarchical analysis so it does not replace the mixed models. It works on any individual time series. It will definitely work with your data! It is a separate analysis to your mixed models and tells you about the timing of changes in the land cover. I would do the break point analysis in addition to your mixed model analyses as discussed yesterday because it is telling you something slightly different, that complements the mixed model analysis in relation to your research questions. I would read up a bit more on the analysis so that you better understand what it does.

One could argue that it is the analysis that responds to a fourth question that you hadn't yet explicitly articulated. I would really encourage you to explicitly think about each targeted question you want to ask and which statistical analysis helps you to answer which question. And that you do that before you run the statistical analyses. From my perspective, the combination of both the mixed model and break point analyses together will give you more complete answers to your questions.

For the break point analysis, because it is not a hierarchical analysis you need to summarise your data to the Latvia scale and then run the analyses - or you could summarise to your larger grid scale and then run one break point analysis for each grid and then compare where the breakpoints are found and how often they fall at or near your SEPs. Remember, in break point analysis you determine how many break points to ask the function to fit. So in your case you would probably ask it to fit two because you have two SPEs, and you would see if the resulting break points line up with your two SPEs.

izzyrich commented 5 years ago

Hi @IslaMS,

I'm getting a bit (big understatement!) concerned. I've been trying to do the moving window analysis to answer the time lag and something very strange is happening - for every land use type and every transition the results are EXACTLY the same for each time step, as well as the original one. I'm not sure what is going on! I'm pretty sure my method is correct because if I change the start date to something random, the results change. Keeping the original period the same though, the results are the same. I guess it could be possible, but this is SO unlikely in my opinion for so many different transitions! It would be great if you could check out my script (lines 551-810 in my statistics_datavis script).

I tried to change the response variable to percent coverage per cell. I'm not sure this makes sense to me though. It makes more sense to me to do something like that per big grid cell, but still, I feel like this representation is not accurate as it ignores the fact that there are other land cover types in each grid and cell. I think I will just keep it as area in kilometres squared. What do you think about this?

In terms of data visualisation, I would like to clarify that my ideas are good:

effect size bar graph for each transition/area change for before and after
overall visualisation for each question similar to my Q1barfig
for the time lag question, I'm thinking of having a line graph with the area/transition area change over time as a general visualisation as well

In terms of the breakpoint analysis, I went through and did this for abandoned land. I feel like it doesn't make sense for/go along with my questions in terms of the fact that I'm assuming that 1991 and 2004 are the actual breakpoints, when in reality, I think there may be a lag for at least some land use types. I tried looking into tutorials that estimated unknown breakpoints, as I think this would be a more interesting addition to my analysis - i.e. maybe there's a breakpoint 5 years after an event, or before! What do you think about this/do you agree? I can't find a great tutorial on this though, but I'll keep looking. I think, most importantly, I need to answer my first 3 questions well first.

I look forward to your reply and being less overwhelmed!

Izzy

IslaMS commented 5 years ago

Hi @izzyrich

I have taken a look at your code and I found a few errors and I think they might solve at least some of your problems, but I am not totally sure. Maybe you can pull the new code and see if the errors that you describe above are sorted. Basically with your filtering you always need to include the 'year ==' bit or you aren't filtering properly. So instead of this:

dplyr::filter(year == 1989 | 1990 | 1992 | 1993)

it should be:

dplyr::filter(year == 1989 | year == 1990 | year == 1992 | year == 1993)

You can always check your filtering by making sure there are only the years you want there to be after you do the filter line of your pipe. Doing those kind of checks throughout is probably a good idea.

Also I don't think you want cell as a random effect as there is only one number per cell per land use type per year right? So it doesn't need to be a grouping factor.

I am not sure your percentage calculation makes total sense to me, I meant to use the area of the cell as the denominator - however big that is, not necessarily the sum of all of the land cover types that you are using in case those don't add up to the total area of the cell. I really don't personally think it makes any statistical difference whether you do area or percentage area, it is what units will be most logical for your reader that you want to think about.

Also, I would encourage you to plot each model at the same time as you do the statistical test. This is a good logical test that the stats are working - does a box plot of your two time periods look different or not? Are the effect sizes what you would expect them to be. A really quick way to plot the data going into one of your models is the following code.

plot(average ~ before_after, data = abandonedlag2)

Unfortunately, I can't test out your break point analysis as my operating system is too old and the package won't load. I am planning to sort this out shortly, but I do think that a break point analysis works for me logically and I can see that from some of your figures it should work with your data. In the break point function, I think you ask for two break points, not when those break points will be located, so if there is a lag the break point will be fit after your actual SEP years (1991 and 2004) if there is no lag the break points will be on the actual SEP years. If there is no change in the land cover then there will be no break points fit. Thus, in a sense the break point analysis tests all of your questions in one go, though it is not possible to implement it in a hierarchical analysis, but if as I said before you ran it at the level of your regional grid, then you could test how often among regions there are two break points fit around the time of the SEPs.

So hopefully that stuff helps you out a bit.

Isla

izzyrich commented 5 years ago

Hi Isla,

That helps a lot! Thank you.

I just want to check that I understand the model outputs when grid is included as a fixed effect.

                       Estimate Std. Error t value Pr(>|t|)

(Intercept) 12.0551 1.4306 8.426 2.92e-15 before_aftersecond -0.6766 1.1981 -0.565 0.5727
gridNE 4.4999 1.8054 2.493 0.0133
gridNW -4.3125 2.0215 -2.133 0.0339
gridSE 9.1192 1.7187 5.306 2.49e-07 gridSW 5.4254 2.1298 2.547 0.0115 *

C is the grid that is the reference, so does this mean that it is the intercept? However, when I add -1 to the end of the model, it shows before_afterfirst as the first line and gridC is nowhere to be seen! I checked and there are definitely values for Grid C in this model. I'm not sure if I'm misinterpreting this, but I'm not sure how to get the slope for Grid C.

Also, if the reference is the first time period, do I need to adjust the slopes of the lines accordingly? Does this affect the error and t value? Also, would I need to report R2 for each individual factor?

If I include grid as a fixed effect, my whole project changes, as I never originally asked what is happening by region. I run the risk of "bad science" - changing my project for it to be more "interesting"/significant. I'm not sure what you think of this. I was thinking of introducing region in the breakpoint question? Let me know what you think.

Thank you! Izzy

izzyrich commented 5 years ago

Due to the days running out and the time change issue, I've made a decision on my own and I hope you agree!

Used LMM for the transitions and areas and honour the fact that I was never going to look at regions as a fixed effect
Use moving window to see if time lag or not
Use break point analysis to see turning point in land use change and transitions
Include graphs about relationship by grid in appendix and state that this could be an important predictor to examine in future

IslaMS commented 5 years ago

I don't think looking at region as a fixed effect is not answering the questions you set out and I don't think it is "bad science" per se. You can also extract this information from the random effects table from your original models, so technically you are testing it already. If you said something was "significant" overall when it was only significant in one region or something that would be bad science. You don't have to explore everything and you don't have to explore region if you don't want, but you had already started down that path by graphing things in your code, so I though you might want to know that you can easily statistically model those relationships to ponder them more.

I think it is a part of trying to understand your results - which I think you should work on doing now. You are feeling stressed out by your timeline, but what is key is to not let mistakes and such creep into your dissertation. Thinking about each result and breaking down the information helps you to understand your findings and make sure you have done what you thought you had done and that there are no mistakes in the data like the year filtering issue. I am not going to pick up on all mistakes as I am not as deep into your project, so you are in the best place to make these additional checks.

If I were you I would pick your key findings and then try to break them down, graph them and understand them in a few different ways to make sure they are robust. You don't need to present all of those different ways in your dissertation, but it is a good check that your data are what you expect and that the model results make sense. Are your results generalizable to all regions across Latvia or are they stronger for certain regions versus others? You started to do this by graphing, which is why I threw in the linear model with a region as a fixed effect. For example, if the overall model has a difference in effect size before and after an SEP then at the region level, this effect should also be clear in a majority of regions.

Doing that also highlighted I think that you are missing a key element to understanding model results tables. This stuff is the most tricky bit and we somewhat covered this in Data Science and it is in the Coding Club tutorials, but not with this level of model complexity. The model tables are relative, so the first category alphabetically is what all other categories are relative to, but in your model you now have two fixed effects. So the effect size for each region is reported relative to the first, but for both the before_after and the region category together. Thus why the C category and the before category are not explicit in the model table. Adding the -1 adds the before category back in, but not the C category. I would stick with the relative table if I were you to try to understand that further if you want, rather than setting intercepts to zero. Or just use that regional model as a fixed effect as a check for yourself of your overall models and the regional patterns that you graphed, with out worrying too much about figuring out the relativeness of the fixed effects at this stage.

What I was noticing for that model is how the effect sizes were negative for the graphed comparisons where there was a greater drop after the SEP. So that helped to convince me that the model structure is working overall and that you do have some regional differences at play. Whether those regional differences are "socioeconomicalpolitically and ecologically" relevant is another question, and not necessarily one you need to address, but I think it is good to see how your overall pattern is represented within the spatial structure of your data.

In general, for every major finding in your dissertation, you want to have a statistical model that tests that finding. But you may also want to have some additional statistical models in your appendix to provide greater context and understanding of your findings and dataset. Not everything needs to be statistically modeled and you don't need to present every model that you run, but sometimes running a few additional models will improve your understanding and trying out different model structures as well - moving random effects to fixed effects and vice versa. As complete as possible understanding is what you are aiming for at this stage of in particular your main results.

izzyrich commented 5 years ago

Hi @IslaMS,

Thank you for your thorough response and sorry for my late one! As you probably sensed, I had a bit of a meltdown/stress overload. Your comments really helped ground me and I feel a lot more equipped to go on.

I have now completed a draft of my methods and results. I feel like (hopefully) my results are balanced in the message conveyed regarding region vs. Latvia as a whole. My most up to date draft is in my writing folder. It would be great if you could have a quick look at it to see if it is along the right lines. Specifically, I'm curious as to where you think my breakpoint section should go e.g. in the beginning of my results or at the end. Currently, I have it at the end, but I think it may be a good basis to the rest of my questions so putting it in the beginning may help to give context to the results that follow. Also, I was wondering if you think I need to specifically have another question addressing the breakpoint analysis. I don't think I do, as it aids to answer all my other questions rather than ask a new one (in my opinions). Let me know what you think.

I will start to plug away at my discussion now! Thank you for all your help and all your support/understanding - it is very appreciated!

Best, Izzy

IslaMS commented 5 years ago

Hi @izzyrich ,

That all sounds great. Because I am your marker, I am not going to take a look at the dissertation text, though I can read a brief section as per at least the Geography guidelines if you want me to. I don't think you need a separate question for the break point analysis necessarily. I think there are multiple appropriate ways it could be incorporated successfully. Best of luck with the final stages!

Isla

izzyrich / dissertation

Break point analysis #8