Closed lorenzznerol closed 3 years ago
As to the seemingly missing time in the first 10020 rows:
It turns out that at obs = 0, it is just midnight, and every traj starts at midnight. At obs = 1, it is 1 AM, and so on.
SELECT obs, traj, mpa, distance, land, lat, lon, temp, "time", z
FROM public.trajectories_stokes_subset_10000 where traj = 10020 order by obs limit 100;
That is why this does not need any data cleansing.
Aggregation was done by a short look at monthly level, see first exploratory analysis. and by generating monthly CSVs to start with.
To do: how to deal with
“The temperature at positions under land influence (land > 0) is wrong, because there is a temperature of 0 degrees Celsius assumed for each land point”
Todo: split each trajectory in x parts and take the mean of x called lon_mean and lat_mean. x can be 20, 10 or 5. Then group by lon_mean and lat_mean and take the count of mpa column. This aggregation is needed to get a faster DTW in #10. We could even think of rounding to one or two decimals and group by on that with count, so that we roughly see where the most trajectories pass. Of course, rounding is not a good idea for professional analysis, then we should distance based clustering like DBSCAN again, or just kmeans again on that. With these clusters, we would see which area had the most trajectories.
Also interesting: group each trajectory just by mpa and take the count of it, as well as the avg, max and min of the lon and lat. By this, we should see which traj has seen which mpa in its lifetime.
Also interesting: group the whole dataset by mpa and aggregate on the max, min and avg of lon and lat. By this, we see where mpa areas are.
Mean values of Lat and Long after group/filter the data in March by MPA. It's interesting that group 2 and 3 are located pretty close to each other.