nielshintzen / vmstools

Automatically exported from code.google.com/p/vmstools
18 stars 12 forks source link

Suggestion of extra rule for splitAmongPings #15

Closed yreecht closed 1 year ago

yreecht commented 5 years ago

Hi,

I was recently linking some VMS and logbook data for a given fleet with the purpose of calculating some CPUEs and in this context I had a closer look on how the function splitAmongPings() behave, and in particular how it prioritizes the allocation of the catch. It looks like with level = "day" it:

1) allocates the catch among pings matching the day, ICESrectangle, trip ID and vessel. 2) the remaining is allocated to pings matching the ICES rectangle, trip ID and vessel and 3) what cannot be matched on ICES rectangles is split among all pings of the trip.

(which is sequentially repeated without trip ID and vessel ID if conserve = TRUE).

This means that if there is a day and rectangle mismatch between the LB and VMS (after step 1), the catch would then be shared among pings of other days (in case of a multiple day trip) but same rectangle, or ultimately among all the pings of the trip if no rectangle matches. However, I have the intuition that the ICES rectangle information in the LB may be less reliable than the day of the catch, in which case one may prefer to share the remaining catch of the day among all pings of this day.

For illustration, in the example below (merged aggregated eflalo and tacsatEflalo), because of a mismatch in the ICES rectangle (likely misspelled in the LB), the whole catch of the 20/09 is allocated to the same rectangle on another day (19/09, only day the vessel has actually been fishing in the rectangle 29E9 during this trip):

> LBsub <- aggregate(formula = LE_KG_TOT ~ FT_REF + LE_RECT + LE_CDAT,
+                    data = subset(eflalo, FT_REF == "319275689",
+                                  select = c("FT_REF", "LE_CDAT", "LE_RECT", "LE_KG_TOT")),
+                    FUN = sum, na.rm = TRUE)

> VMSsub <- aggregate(formula = LE_KG_TOT ~ FT_REF + LE_RECT + SI_DATE,
+                     data = subset(tacsatEflalo, FT_REF == "319275689",
+                                   select = c("FT_REF", "SI_DATE", "LE_RECT", "LE_KG_TOT")),
+                     FUN = sum, na.rm = TRUE)

> merge(x = LBsub, y = VMSsub, all = TRUE,
+       by.x = c("FT_REF", "LE_CDAT", "LE_RECT"),
+       by.y = c("FT_REF", "SI_DATE", "LE_RECT"),
+       suffixes = c(".LB", ".VMS"))
     FT_REF    LE_CDAT LE_RECT LE_KG_TOT.LB LE_KG_TOT.VMS
1 319275689 17/09/2017    28E7           NA             0
2 319275689 18/09/2017    28E9         4860          4860
3 319275689 19/09/2017    28E9         1188          1188
4 319275689 19/09/2017    29E9         1152          8712  ## <- VMS: same as misspelled rectangle but on another day => where all the unmatched catch ends up.
5 319275689 20/09/2017    29E9         7560            NA  ## <- LB with misspelled rect.
6 319275689 20/09/2017    30E9           NA             0  ## <- VMS: where it should end up (same day)... zero catch instead

Another example where one event recorded in what appears to be a misspelled rectangle (27E9, on the 06/10; the vessel has not been fishing at all in it during the trip according to VMS) was allocated among all pings of the trip rather than being kept on that day (1296+2880 >> 3148):

     FT_REF    LE_CDAT LE_RECT LE_KG_TOT.LB LE_KG_TOT.VMS
1 319816571 03/10/2017    28E9          504      526.3448
2 319816571 04/10/2017    28E9         4716     4984.1379
3 319816571 05/10/2017    28E9         4536     4804.1379
4 319816571 06/10/2017    27E9         1296            NA  ## <- LB with misspelled rect. Catch split among all pings of the trip
5 319816571 06/10/2017    28E9         2880     3148.1379  ## <- VMS: catch does not add up to 4176t.
6 319816571 07/10/2017    28E9         3852     4120.1379  ## <- VMS: all other days end up with more catch allocated.
7 319816571 08/10/2017    28E9         4248     4449.1034

I suppose that whether this is usually better than ensuring the conservativeness of the catch per day can be subject to debate but I think it is not in my particular situation. So I have written a function splitAmongPings3, based on the original one, with an extra step (between the original first and second steps) for the allocation of the catch of the day that was not matched on the rectangle(s) to that day. The matching on rectangle only is still done afterwards, for those days in the LB with no matching VMS. I have limited this extra step for entries with at least a vessel ID (when conserve = TRUE) as I reckon that if no vessel information can be matched, the ICES rectangle is still the best spatial information we have (several vessels may be all over the place on a same day).

My two former examples now become respectively:

     FT_REF    LE_CDAT LE_RECT LE_KG_TOT.LB LE_KG_TOT.VMS
1 319275689 17/09/2017    28E7           NA             0
2 319275689 18/09/2017    28E9         4860          4860
3 319275689 19/09/2017    28E9         1188          1188
4 319275689 19/09/2017    29E9         1152          1152
5 319275689 20/09/2017    29E9         7560            NA  ## <- LB: misspelled rect.
6 319275689 20/09/2017    30E9           NA          7560  ## <- VMS: catch conserved for the day

and

     FT_REF    LE_CDAT LE_RECT LE_KG_TOT.LB LE_KG_TOT.VMS
1 319816571 03/10/2017    28E9          504           504
2 319816571 04/10/2017    28E9         4716          4716
3 319816571 05/10/2017    28E9         4536          4536
4 319816571 06/10/2017    27E9         1296            NA  ## <- LB: misspelled rect.
5 319816571 06/10/2017    28E9         2880          4176  ## <- VMS: 4176 = 2880 + 1296
6 319816571 07/10/2017    28E9         3852          3852  ## <- VMS: no side effect on other days
7 319816571 08/10/2017    28E9         4248          4248

which seems a bit tidier and is conservative of the catch for the day.

The function is used as the original and simply has an extra optional parameter priorityDay (= FALSE: default is to behave like the original function, and priorityDay = TRUE only has an effect if level = "day"). The overall catch allocated to VMS data is consistent with the original function when conserve = TRUE (I think it should be regardless), only the way it is allocated among pings is different.

Do you think this would be a more sensible approach for allocating the catch among pings for the general case? Or that the user should be given the choice? If you are interested, I can send a pull request to incorporate the code in the package.

Best wishes, Yves

nielshintzen commented 1 year ago

Upgraded with new functionality