mablab / sftraj-proposal

Repository for sftraj project proposal for ISC
https://www.r-consortium.org/projects/call-for-proposals
Creative Commons Attribution 4.0 International
3 stars 0 forks source link

Trajectory as LINESTRING #1

Closed basille closed 5 years ago

basille commented 5 years ago

I defined a trajectory as a path (curve), formed by steps as elementary units. I thus propose to use LINESTRINGs with two vertices (start and end point of each step) as basic elements of a sftraj object.

basille commented 5 years ago

From Bart Kranstauber @bart1

I have considered LINESTRING but am not very keen on it for a few reasons (I do however think it should be easy to convert between both representations).

These arguments basically boil down to the fact that I think lines are already a derived property and I would say it is more important to facilitate the original format.

basille commented 5 years ago

From Edzer Pebesma

Points (fixes) will always be your starting point, your observations, and any movement between points involves assumptions and/or modelling; even a two-point LINESTRING makes such an assumption. If it made the assumption of some smoothing, it doesn't let you reconstruct the original observations.

With LINESTRINGs of more than two points, you will typically have mostly self-intersecting lines, which renders most of the geometry operations as they are in sf (from GEOS) (do two trajectories cross?) useless.

basille commented 5 years ago

From Ioannis Kosmidis

By the way, while working on methods that coerce trackeRdata objects to sf, I have used LINESTRING which seems to work well. I think LINESTRING + timestamp is a good choice as a basis of the standard, because it looks to me being the most-tedious in terms of retaining the original measurements.

bart1 commented 5 years ago

Thinking a little bit more about this is i also realized that removing locations from a LINESTRING based trajectory is probably quite a bit more complicated because at once 2 lines strings need to be modified to create a new third one. What would be the strong reasons speaking to base such a class on LINESTRING instead of POINT?

basille commented 5 years ago

To me, this all come from the data model. I tried to detail my reasoning in the proposal and in this figure:

Data model

Essentially, the main reason is that we considering movement and not locations anymore. This might sound like pedantic semantic, but all conceptual models of movement (e.g. Turchin 1998, Nathan et al. 2008) are based on elementary units of steps. Movement analyses generally rely on this step model (e.g. random walks, Step Selection Functions, State-Space Models/Hiden-Markov Models).

This does not mean we're not considering points: as Edzer rightly pointed out, we start from point data. If all we needed was raw data however, we wouldn't use a data model at all, we could work directly (in this case) on POINTS. But that makes plotting and analyses less direct than considering steps right away.

Let me try here to address all your points:

Lines are already making inferences about the movement between two locations that are frequently not valid, this representation for me is already a derivative of the known information.

Yes, by design of using a model. This is not a bug, this is a feature. Turchin consider straightline movement as artificial, as an "idealization". Note that we're intuitively doing this idealization when we're plotting trajectories.

Most movement data is recorded as positions over time (I would be happy for counter examples) and most algorithms deal with positions over time I think converting back and forward is unneeded and might induce conversion error and inaccuracies.

Yes, we start with point data (tracking data to be precise, i.e. (x,y[,z],t)). But this is only the raw data, not our trajectory model. There shouldn't be much need to go back and forth, if we're working in a movement framework.

A lot of information is attached to the positional observations (GPS accuracy, tracking device diagnostics usw). When storing as points this is much more easily captures.

Yes, and data.frames allow this information to be stored. Note that information can also be attached to the step itself, beyond its geometrical properties. Examples include population density along the step, proportion of open habitat along the step, difference of elevation along the step, etc. Note that sf objects have an agr attribute that defines the attribute-geometry-relationship, i.e. how attributes relate to their geometry (i.e. unique, constant, aggregate). We may be able to use it to define whether ancillary data is attached to the initial location or to the step.

There is information attached to the segments (speed, duration, distance) but this information is mostly derived. I would opt for calculation this on the fly (this means the information is always up to date, most is quick enough anyway and it is known which algorithm was used (e.g. haversine, Rhumbline distance)). When there is a need to store this it can be done by adding an NA to the vector quite easily.

Yes again, agree 100%. Step geometrical attributes (speed, duration, distance) can be computed on the fly with specific functions. Users can store this information if they wish so. However, starting from LINESTRINGs makes this operation a lot easier.

I think there is a need to include auxiliary information from other sensors (heart rate, acceleration, pressure loggers). When using point information this can be easily incorporated using a POINT EMPTY. Given these measurements are mostly recorded with one timestamps and not a period it is easier combine these with point observations.

Definitely, we will need to figure out a way to associate trajectory data to auxiliary information from other sensors. However, these are very different objects, with most of the time a different sampling rate. Following the principles of tidy data, I would not mix those in the same data.frame than steps, as they are different observational units. This said, we will need functions to aggregate or summarize these data along steps (I think this should come in a secondary stage of the project).

Removing locations from a LINESTRING based trajectory is probably quite a bit more complicated because at once 2 lines strings need to be modified to create a new third one.

I see your point… but why would a point be removed after converting to trajectories? In my opinion, this should happen beforehand (or am I missing something?). In any case, that could be a rare case where there would be an need to convert back to POINTS

edzer commented 5 years ago

Are you envisioning a constraint to continuous trajectories, where location 2 of step i is identical to location 1 of step i+1? How would you deal with a trajectory consisting of a single point?

bart1 commented 5 years ago

Thank you for your elaborate reply. I'll reply where relevant.

Essentially, the main reason is that we considering movement and not locations anymore. This might sound like pedantic semantic, but all conceptual models of movement (e.g. Turchin 1998, Nathan et al. 2008) are based on elementary units of steps. Movement analyses generally rely on this step model (e.g. random walks, Step Selection Functions, State-Space Models/Hiden-Markov Models).

I agree that movement is a collection of steps and that it is important to consider it as such. But then steps are just as non independent as the points are. Along those lines one could argue for one big LINESTRING containing all points in a track.

I feel to some extent that storing segments is some midway compromise that suffers from disadvantages from both. Each segment is connected to the next one also by properties like the location error. Therefore segments are not very satisfactory to me either. For me the segmented structure is ensured by having a class at the higher trajectory level. By moving from POINTS to segments you just move this discrepancy one level up.

This does not mean we're not considering points: as Edzer rightly pointed out, we start from point data. If all we needed was raw data however, we wouldn't use a data model at all, we could work directly (in this case) on POINTS. But that makes plotting and analyses less direct than considering steps right away.

Yes, we start with point data (tracking data to be precise, i.e. (x,y[,z],t)). But this is only the raw data, not our trajectory model. There shouldn't be much need to go back and forth, if we're working in a movement framework.

I think here it is important to acknowledge that most analyses (which I think such a class should facilitate, otherwise it kind of becomes an academic exercise) are designed to deal with the locations. This goes both for older but still used methods like mcp and kernels, but also newer methods. For example CTMM and BBMM can be estimated by a kalman filter looping over all locations and estimate the likelihood of the trajectory at once. One counter example might be step selection functions, where the analysis is on the basis of steps. But still frequently information from the previous step is included (to calculate turn angles). This is also an example where often filtering is needed to regularize the track. FTP might be approached by intersecting circles with segments, but I would guess that calculating distances is quicker. I think most implementation of analysis need a lot of special conditions to deal with a segment wise version, to deal with the last location.

Most packages dealing with movement that I'm aware of (and you probably better from the review) do store the track as locations (e.g. adehabitatLT, CTMM, move and trajectories). That is not necessarily a reason to do so and stick to the status quo but I think it is a reason to carefully consider doing otherwise.

Yes, and data.frames allow this information to be stored. Note that information can also be attached to the step itself, beyond its geometrical properties. Examples include population density along the step, proportion of open habitat along the step, difference of elevation along the step, etc. Note that sf objects have an agr attribute that defines the attribute-geometry-relationship, i.e. how attributes relate to their geometry (i.e. unique, constant, aggregate). We may be able to use it to define whether ancillary data is attached to the initial location or to the step.

In the scenario where the basis is LINESTRINGs, i feel you have to always deal with either the first or last location. I feel this would frequently need to be dealt with as a special condition. One option would be a 0 length LINESTRING but this is also a kind of hack.

Yes again, agree 100%. Step geometrical attributes (speed, duration, distance) can be computed on the fly with specific functions. Users can store this information if they wish so. However, starting from LINESTRINGs makes this operation a lot easier.

Definitely, we will need to figure out a way to associate trajectory data to auxiliary information from other sensors. However, these are very different objects, with most of the time a different sampling rate. Following the principles of tidy data, I would not mix those in the same data.frame than steps, as they are different observational units. This said, we will need functions to aggregate or summarize these data along steps (I think this should come in a secondary stage of the project).

I agree to some extent with this argument and see the point of separating the data from different data streams. One approach could be to nest all data streams on one individual/entity together, both facilitating different tracking technologies, tracking episodes and auxiliary sensors. One reason facilitate this to some extent would be that if one filters for a time period it would be nice that this operation is applied on all data streams. In that sense it might be worth to explore the tsibble package for time series tibbles although I have not looked into the details.

I see your point… but why would a point be removed after converting to trajectories? In my opinion, this should happen beforehand (or am I missing something?). In any case, that could be a rare case where there would be an need to convert back to POINTS

For me such an sfTraj class should also facilitate cleaning up trajectories, which for many types of data collection involves marking or removing points from trajectories. I envision that this should be part of the class, instead of doing that all before. If this is not easily possible one needs to do and implement a lot of separate data handling while such a class would have most already implemented. In frequently used filter where a movement class is useful is when you want to filter locations that have a 180 degree turn angle in combination with a high speed or GPS location error.

An other problem I see is that any manipulation of the locations have to be done with a lot of extra care to not invalidate the trajectory. Assuming one requirement is that the end point of the previous segment should be the same of the start point of the next this needs to be continuously dealt with and validated while with a point representation this is automatically accounted for and valid (the same would be the case if the track is one long LINESTRING).

This duplication of start en end points for me violates some of the basic ideas of how to store data, where some basic data is duplicated.

edzer commented 5 years ago

I can see a lot of reasonable arguments on both sides. As a resolution: why not go for both, and make the exploration of what works best part of the project. My feeling is that (i) although segments are needed at some stage (plotting...) most operations you want to do are much easier to carry out with a point representation (with time an attribute, not in M), (ii) users will find it easier to work with the points, so uptake/reuse of that will be much stronger compared to LINESTRING with M=time representations. Having both and conversion from one to the other should be trivial.

basille commented 5 years ago

@edzer

Are you envisioning a constraint to continuous trajectories, where location 2 of step i is identical to location 1 of step i+1?

I have been myself too often in a situation where I needed disconnected steps to enforce it… There is the problem with NAs (discussed here), but beyond this, I think we should allow users to remove steps in between other steps, or, rather, to select only some disconnected steps (imagine you want to only extract steps longer than a threshold, or associated to a given mode for instance).

How would you deal with a trajectory consisting of a single point?

Probably NA. By definition, a trajectory requires at least two points. But admittedly, we'll have to assess how to deal with in practice!

I can see a lot of reasonable arguments on both sides. As a resolution: why not go for both, and make the exploration of what works best part of the project.

I like this idea very much! We can definitely present both perspectives in the proposal, and explain that we're going to evaluate both (basically data model based on steps, possible practical constraints that would lead to points). I think an approach will be to define use cases, workflow from raw data to analysis, from various people to see what would meet most requests.

basille commented 5 years ago

@bart1

I agree that movement is a collection of steps and that it is important to consider it as such. But then steps are just as non independent as the points are. Along those lines one could argue for one big LINESTRING containing all points in a track.

Except that one big LINESTRING does not allow to have step characteristics. However, this LINESTRING approach is very much valid, and that will be the point of nesting (which I realize I did not detail at all in the current proposal — I'll try to work on that soon).

I feel to some extent that storing segments is some midway compromise that suffers from disadvantages from both.

The idea of the step is that it is nevertheless a LINESTRING, so it would be rather straightforward to nest to a higher level (say a big LINESTRING per individual) by nesting. In this sense, it is not a compromise, but the elementary unit in a LINESTRING thinking, that can be scaled up. This is essentially a "mental" switch from point locations to step/lines. (but as said above, I would be happy to evaluate both from a practical perspective)

Each segment is connected to the next one also by properties like the location error.

When you have a chance, would you elaborate on this? I'm not sure I see what you mean.

I think here it is important to acknowledge that most analyses (which I think such a class should facilitate, otherwise it kind of becomes an academic exercise) are designed to deal with the locations.

Here, I would like to exclude non-movement analyses, such as MPC or kernels, which do not care about the sequential nature of locations. These are point analyses, and should not require a trajectory to start with.

This said, there is also a need to evaluate whether analyses deal with locations for practical reasons, given that it's what's available, or from an algorithmic reasons — what I mean is that if the input is points, but the method actually build steps, then it's inherently working on step data.

I agree to some extent with this argument and see the point of separating the data from different data streams. […] In that sense it might be worth to explore the tsibble package for time series tibbles although I have not looked into the details.

I will look into tsibble, thanks! What is sure is that we will need (probably not in the first stage of the project though) ways to summarize other data at the scale of the step or trajectory — in other words, have easy way to deal with the temporal dimension of the trajectory.

For me such an sfTraj class should also facilitate cleaning up trajectories, which for many types of data collection involves marking or removing points from trajectories. I envision that this should be part of the class, instead of doing that all before. If this is not easily possible one needs to do and implement a lot of separate data handling while such a class would have most already implemented. In frequently used filter where a movement class is useful is when you want to filter locations that have a 180 degree turn angle in combination with a high speed or GPS location error.

I see that now, thanks. Yes, this should be made easy by such a class, by design. In particular, how to deal with turn angle is not straightforward to me — it is an attribute of a step, but not independently of other steps, which introduces dependence in the observations…

An other problem I see is that any manipulation of the locations have to be done with a lot of extra care to not invalidate the trajectory. Assuming one requirement is that the end point of the previous segment should be the same of the start point of the next this needs to be continuously dealt with and validated while with a point representation this is automatically accounted for and valid (the same would be the case if the track is one long LINESTRING).

See my answer to @edzer above: I don't think we should enforce steps to be connected — or we need to allow a way to not have them connected. As a consequence, we also need a way to reconnect them if we wish so.

This duplication of start en end points for me violates some of the basic ideas of how to store data, where some basic data is duplicated.

True, which is why it should be considered only if the benefits are outweighing the cost. And it will have to be tested on very large datasets too (say >1M records).

Thanks a lot for the discussion, this is very informative and going in the right direction I think.