Function to remove duplicates

dfalster commented 5 years ago

@ehwenk has been looking for duplicates and suggested a more systematic check. Seems we have ~40k!!

To get this number i pasted together trait_name, species_name, value and then looked for mateches.

x$traits %>%
   mutate(check = paste(trait_name, species_name, value), dup = duplicated(check)) %>%
   arrange(check) %>%
   filter(check %in% .$check[.$dup]) -> z
z
# A tibble: 80,632 x 12
   dataset_id species_name           site_name observation_id trait_name          value unit  value_type  replicates original_name          check                                           dup
   <chr>      <chr>                  <chr>     <chr>          <chr>               <chr> <chr> <fct>       <chr>      <chr>                  <chr>                                           <lgl>
 1 ANBG_2018  Acacia cupularis       NA        ANBG_2018_0048 dispersal_appendage aril  NA    expert_mean NA         Acacia cupularis       dispersal_appendage Acacia cupularis aril       FALSE
 2 Manea_2011 Acacia cupularis       NA        Manea_2011_021 dispersal_appendage aril  NA    expert_mean NA         Acacia cupularis       dispersal_appendage Acacia cupularis aril       TRUE
 3 ANBG_2018  Acacia cyclops         NA        ANBG_2018_0049 dispersal_appendage aril  NA    expert_mean NA         Acacia cyclops         dispersal_appendage Acacia cyclops aril         FALSE
 4 Manea_2011 Acacia cyclops         NA        Manea_2011_022 dispersal_appendage aril  NA    expert_mean NA         Acacia cyclops         dispersal_appendage Acacia cyclops aril         TRUE
 5 ANBG_2018  Acacia jibberdingensis NA        ANBG_2018_0084 dispersal_appendage aril  NA    expert_mean NA         Acacia jibberdingensis dispersal_appendage Acacia jibberdingensis aril FALSE
 6 Manea_2011 Acacia jibberdingensis NA        Manea_2011_045 dispersal_appendage aril  NA    expert_mean NA         Acacia jibberdingensis dispersal_appendage Acacia jibberdingensis aril TRUE
 7 ANBG_2018  Acacia leiophylla      NA        ANBG_2018_0087 dispersal_appendage aril  NA    expert_mean NA         Acacia leiophylla      dispersal_appendage Acacia leiophylla aril      FALSE
 8 Manea_2011 Acacia leiophylla      NA        Manea_2011_049 dispersal_appendage aril  NA    expert_mean NA         Acacia leiophylla      dispersal_appendage Acacia leiophylla aril      TRUE
 9 ANBG_2018  Acacia mangium         NA        ANBG_2018_0098 dispersal_appendage aril  NA    expert_mean NA         Acacia mangium         dispersal_appendage Acacia mangium aril         FALSE
10 Manea_2011 Acacia mangium         NA        Manea_2011_055 dispersal_appendage aril  NA    expert_mean NA         Acacia mangium         dispersal_appendage Acacia mangium aril         TRUE
# ... with 80,622 more rows

In total looks like there ~40000 duplicate records

> sum(z$dup)
[1] 44895

Here are the traits with most overlap:

> table(z$trait_name) %>% sort(decreasing=TRUE)

                          plant_growth_form                                life_history                                plant_height                                  leaf_width
                                      16219                                       11259                                       11007                                        8650
                                leaf_length                           leaf_compoundness                              leaf_phenology                                 seed_length
                                       7816                                        7090                                        2916                                        2850
                                 seed_width                              flowering_time                                 leaf_margin                                wood_density
                                       2156                                        1760                                        1697                                         911
                              flower_colour                          specific_leaf_area                          dispersal_syndrome                                   seed_mass
                                        889                                         708                                         653                                         426
                            nitrogen_fixing                                seed_breadth                                   leaf_area                      photosynthetic_pathway
                                        415                                         408                                         329                                         317
                                  woodiness                                  leaf_shape                               leaf_dry_mass                              regen_strategy
                                        316                                         188                                         147                                         146
                           leaf_arrangement                             leaf_P_per_area                         leaf_N_per_dry_mass                                   leaf_type
                                        143                                         127                                         119                                         108
                                 seed_shape                                  fruit_type                         leaf_P_per_dry_mass                             leaf_N_per_area
                                         94                                          92                                          92                                          48
                        fruit_type_function                         dispersal_appendage                         leaf_K_per_dry_mass                                growth_habit
                                         43                                          42                                          38                                          37
                            leaf_area_ratio                     leaf_dry_matter_content                             leaf_K_per_area                          leaf_mass_fraction
                                         36                                          36                                          36                                          36
                leaf_water_content_per_area                photosynthetic_rate_per_area            photosynthetic_rate_per_dry_mass leaf_photosynthetic_nitrogen_use_efficiency
                                         36                                          30                                          28                                          24
   leaf_photosynthetic_water_use_efficiency               stomatal_conductance_per_area                           fruit_type_botany                              leaf_thickness
                                         24                                          24                                          23                                           8
                              fruiting_time                               fire_response                         leaf_C_per_dry_mass                           seed_mass_reserve
                                          7                                           4                                           4                                           4
                                   serotiny                     leaf_cell_wall_fraction                            leaf_cell_wall_N                   leaf_cell_wall_N_fraction
                                          4                                           2                                           2                                           2
                              leaf_delta15N                           root_wood_density                               storage_organ
                                          2                                           2                                           2

Here are the studies with some overlap to another (or their own)

> table(z$dataset_id) %>% sort(decreasing=TRUE)

     RBGSYD_0000      Barlow_1981     Kooyman_2011         WAH_1998     Wheeler_2002         SAH_2014         NTH_2014      Maslin_2012       Brock_1993     Fonseca_2000    AusGrass_2014
           16025            14016            10180             8835             4883             2827             2728             1706             1548             1421             1243
       TMAG_2009      Hughes_1992      Chen_2015_1    Chinnock_2007    Leishman_1992    Richards_2008     Schmidt_2003        CPRR_2002        Ilic_2000    Leishman_1995      Jurado_1993
             969              942              931              912              801              707              679              617              600              554              462
   Morgan_2011_1     Catford_2014    Osbourne_2014       NHNSW_2016    Blackman_2014   Tomlinson_2013       Lawes_2012      Craven_1987     Gleason_2012     Clayton_2006        Rice_1991
             456              413              399              389              349              289              276              275              240              228              191
     Wright_2008        Lord_1997       Pekin_2011   Soliveres_2012       Prior_2003     Schulze_0000     Westoby_2004        Venn_2011    Thompson_2001      Butler_2011     Angevin_2010
             180              171              166              136              135              132              131              130              117              115              105
   Chandler_2002        ANBG_2018      Morgan_2005         Rye_2015       Cross_2011        Lunt_2012 Gallagher_2011_3    Leishman_2011     Toelken_1996     Edwards_2000   Niinemets_2009
             105              102               99               79               72               68               65               65               64               63               61
      Eamus_1999      Specht_1958        Venn_2014         Rye_2006        RBGK_2014         Rye_2002      Chen_2015_2      Wright_2000      Duncan_1998         Kew_2010     Peeters_2002
              52               49               49               47               45               42               41               39               37               37               36
      Rye_2009_2        Bean_1997      Wright_2002      Wright_2004    Morgan_2011_2  Cunningham_1999     Trudgen_2014       Crisp_1984      Jordan_2007       Rye_2013_1         Tng_2013
              36               32               32               32               30               29               28               27               27               27               26
  Falster_2005_2       Eamus_1998      Lamont_2002      Morgan_2014        Knox_2011     Islam_1999_2      Laxton_2005       Scott_2010       BRAIN_2007       Rye_2013_2     Westoby_2016
              25               24               24               23               22               19               19               17               16               14               14
     Wright_2006     Westoby_2003       Bolza_1975      Hyland_2003       Manea_2011      RBGSYD_2014       Rye_2009_1     Trudgen_2005      Craven_2010       ICRAF_2009    Harrison_2009
              13               12               11               11               10               10               10               10                8                8                7
     Henery_2001        Vesk_2004    Richards_2003      Wright_2001    Keighery_2004      Wilson_2008 Gallagher_2011_2        Hong_1999      Hughes_2005    Leishman_1993      Cooper_2004
               7                7                6                5                4                4                3                3                3                3                2
        Kew_2012   Falster_2005_1    Prospect_2009        Seng_1951       Wells_2009
               2                1                1                1                1

dfalster commented 5 years ago

Possible solutions are

Try and duplicate study by study
Add a function that allows for some level of deduplication on the finished product. By default keep the record from the oldest source

@ehwenk and I prefer the 2nd option because

will be an ongoing solution
all data from each study is included by default

rachaelgallagher commented 5 years ago

OK - I agree about the de-duplication step as a good option. As far as I can see, a lot of these are happening because they are categorical (e.g. growth form, life history) and this is a fairly fixed trait across the published floras of Australia. I am not sure that these really qualify as duplicates in the true sense, I think they are replicates across sources. The 'true' duplicates are much more likely to be the SLA measures and these will really benefit from a de-duplicaiton step.

dfalster commented 5 years ago

Ok. I suspect with the floras there's a certain amount of repeating going on. But in any case, agreed on SLA. Some quick stats suggest maybe 381 / 4273 records are duplicates

> x$traits %>%
+   mutate(check = paste(trait_name, species_name, value), dup = duplicated(check)) %>%
+   arrange(check) %>%
+   filter(check %in% .$check[.$dup]) -> z
>
> z %>% filter(trait_name == "specific_leaf_area") %>% pull(dup) %>% sum()
[1] 381
>
> austraits$traits %>% filter(trait_name == "specific_leaf_area") %>% nrow()
[1] 4273

rachaelgallagher commented 5 years ago

10% is not too bad, especially if we can implement an easy way to flag these. The issue will then be how users decide to attribute the observations, I guess.

traitecoevo / austraits.build

Function to remove duplicates #153