Open charlesollion opened 1 year ago
First the cut-off point for the ratio trash_count/num_gps_points
was investigated. The idea is that we need the point offering the best tradeoff between disabling faulty campaigns, and keeping the valid ones (that might just have a lot of trash on few GPS points).
This ratio trash_count/num_gps_points
without outliers has the following distribution parameters:
It was discussed to choose 7.72 as a cut-off point, as this represents the 75th percentile (and would allow to disable all the outliers). However, the number of trash points in this upper quantile is 28 864, representing ≈75% of the trash points... Choosing the cut-off point to be 10 would disable ≈67% of the trash points.
@charlesollion any thoughts on this?
Update from teams conversation :
we notice that for some campaign distance = 0 in bi.campaign_river even if there are correct gps point and correct distance in bi.campaign (exemple campaign : 751ff2bd-2b2b-4f4f-bc07-7143e6a0ed2e )
the data in the_geom column is indeed different for bi.campaign and bi.campaign_river for the same id (there is only one dot in bi.campaign_river). What's the difference between these two tables?
EDIT from charles: we found that the river 20305 has incomplete referential, therefore the campaign GPS points are mapped to a single point in the river, hence the single point in campaign_river, and distance = 0)
I would like to add a brief remark: if we automatically disable campaigns with distance = 0, is there a risk of ghosting campaigns where it has only one observation, for exemple, of an accumulation zone (which is one of the project's goal) ?
Update:
The data analysis done earlier was using the GPS points in the_geom
, which represent the total amount of GPS points automatically retrieved (independent of the trash points), instead of taking the GPS points associated to trash points. The analysis was re-done using the latter and the cut-off point was updated. The 75th percentile of the ratio is ~6.5. Using this as a cut-off point would remove most outliers but it would also disable many campaigns and a big amount of trash points. Taking a cut-off point of 8 revolves in disabling ~41% of campaigns, containing all in all ~44% of the trash points. This seems like a lot but from visual analysis of the campaigns we saw a lot of outliers. Additionally there are some test campaigns containing a lot of trash points, thus making the number of trash points in the disabled campaigns seem excessive.
if we automatically disable campaigns with distance = 0, is there a risk of ghosting campaigns where it has only one observation, for exemple, of an accumulation zone
(As discussed in the Teams chat): This should be configured differently in future campaigns, so it would affect only the past ones. From what we observed most of the campaigns having distance = 0 would still be disabled using the current ratio. Adding num_different_locations <= 1
disables 169 additional campaigns, containing 0.009% of the total amount of trash points so it does not seem like we loose too much valuable information.
We want to update
bi.campaign_river
to set the columndisabled
to true in the following cases: