surfriderfoundationeurope / etl

ETL (Extract Transform Load) Data Management process
MIT License
2 stars 0 forks source link

Mise à jour des scripts BI pour inclure des filtres sur les campagnes "disabled" #29

Open charlesollion opened 1 year ago

charlesollion commented 1 year ago

We want to update bi.campaign_river to set the column disabled to true in the following cases:

cambouvy commented 1 year ago

First the cut-off point for the ratio trash_count/num_gps_points was investigated. The idea is that we need the point offering the best tradeoff between disabling faulty campaigns, and keeping the valid ones (that might just have a lot of trash on few GPS points).

This ratio trash_count/num_gps_points without outliers has the following distribution parameters:

image

It was discussed to choose 7.72 as a cut-off point, as this represents the 75th percentile (and would allow to disable all the outliers). However, the number of trash points in this upper quantile is 28 864, representing ≈75% of the trash points... Choosing the cut-off point to be 10 would disable ≈67% of the trash points.

@charlesollion any thoughts on this?

sabineallousurfrider commented 1 year ago

Update from teams conversation :

we notice that for some campaign distance = 0 in bi.campaign_river even if there are correct gps point and correct distance in bi.campaign (exemple campaign : 751ff2bd-2b2b-4f4f-bc07-7143e6a0ed2e )

the data in the_geom column is indeed different for bi.campaign and bi.campaign_river for the same id (there is only one dot in bi.campaign_river). What's the difference between these two tables?

EDIT from charles: we found that the river 20305 has incomplete referential, therefore the campaign GPS points are mapped to a single point in the river, hence the single point in campaign_river, and distance = 0)

sabineallousurfrider commented 1 year ago

I would like to add a brief remark: if we automatically disable campaigns with distance = 0, is there a risk of ghosting campaigns where it has only one observation, for exemple, of an accumulation zone (which is one of the project's goal) ?

cambouvy commented 1 year ago

Update: The data analysis done earlier was using the GPS points in the_geom, which represent the total amount of GPS points automatically retrieved (independent of the trash points), instead of taking the GPS points associated to trash points. The analysis was re-done using the latter and the cut-off point was updated. The 75th percentile of the ratio is ~6.5. Using this as a cut-off point would remove most outliers but it would also disable many campaigns and a big amount of trash points. Taking a cut-off point of 8 revolves in disabling ~41% of campaigns, containing all in all ~44% of the trash points. This seems like a lot but from visual analysis of the campaigns we saw a lot of outliers. Additionally there are some test campaigns containing a lot of trash points, thus making the number of trash points in the disabled campaigns seem excessive.

cambouvy commented 1 year ago

if we automatically disable campaigns with distance = 0, is there a risk of ghosting campaigns where it has only one observation, for exemple, of an accumulation zone

(As discussed in the Teams chat): This should be configured differently in future campaigns, so it would affect only the past ones. From what we observed most of the campaigns having distance = 0 would still be disabled using the current ratio. Adding num_different_locations <= 1 disables 169 additional campaigns, containing 0.009% of the total amount of trash points so it does not seem like we loose too much valuable information.