Closed teixeirak closed 3 years ago
Good!
Mostly as a note to myself, some useful code was in allo_order()
. I removed it recently via this commit. I can easily revert it and build on top of that.
Agreed. I think we'll want that functionality, but user should be able to enter the command as to how to prioritize equations just once.
@teixeirak,
... For site, put something like "Any" or "NA". For species, fill in specificity for which equation is designed, (e.g., "Picea spp.).
It seems that generic equations are already associated to sites and species. I seems like a contradiciton (how is an equation generic and at the same time it is site and species specific) but it is what I need to match each species and site with an equation, so I don't complain -- just notice this as something that I need to understand better.
library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
allodb::master_tidy() %>%
select(
site,
equation_group,
species,
equation_id,
dependent_variable_biomass_component
) %>%
filter(equation_group == "Generic") %>%
unique()
#> Joining `equations` and `sitespecies` by 'equation_id'; then `sites_info` by 'site'.
#> # A tibble: 232 x 5
#> site equation_group species equation_id dependent_variable_bio~
#> <chr> <chr> <chr> <chr> <chr>
#> 1 lilly d~ Generic Asimina tri~ ae65ed Total aboveground biom~
#> 2 lilly d~ Generic Carpinus ca~ ae65ed Total aboveground biom~
#> 3 lilly d~ Generic Celtis occi~ ae65ed Total aboveground biom~
#> 4 lilly d~ Generic Paulownia t~ ae65ed Total aboveground biom~
#> 5 lilly d~ Generic Rhus typhina ae65ed Total aboveground biom~
#> 6 scbi Generic Ailanthus a~ ae65ed Total aboveground biom~
#> 7 scbi Generic Asimina tri~ ae65ed Total aboveground biom~
#> 8 scbi Generic Berberis th~ ae65ed Total aboveground biom~
#> 9 scbi Generic Carpinus ca~ ae65ed Total aboveground biom~
#> 10 scbi Generic Celtis occi~ ae65ed Total aboveground biom~
#> # ... with 222 more rows
Created on 2019-03-21 by the reprex package (v0.2.1)
Here's the plan that we've worked out:
Erika has already reviewed the current species lists for temperate sites and assigned the best available equations, which in some cases are generic. There is value to keeping all of these links to generic equations in the sitespecies table because it indicates (1) that the species is present at the site and (2) the species has been reviewed, and it was determined that the generic equation was the best available option. Thus, what we currently have stays as is.
Generic equations can also be applied to any site within a specified region (e.g., temperate North America), including for stems at ForestGEO sites where the DBH is greater than the upper DBH limit of the expert-selected equation. For these, the site species table will contain records with: site = "any temperate NA" (or such) species = e.g., "Quercus sp." (any Quercus)
Note that for species that are specifically assigned a generic equation, the record in the sitespecies table is superfluous from a coding perspective. However, its important data from the perspective of someone who may want to pull up the list of species and associated allometries for a given site, and to disambiguate between species that have never been reviewed (e.g., if a new species shows up at a site) and those that have been reviewed but found to have no specifically appropriate allometry.
Does this make sense? If it would be useful, I could add an example to the sitespecies table.
Just to make sure we are on the same page. I assume that the information that distinguishes "Expert" from "Generic" equations is already encoded in the column sitespecies$equation_group
.
library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
library(allodb)
sitespecies %>%
filter(equation_group == "Expert") %>%
select(equation_group, everything())
#> # A tibble: 540 x 11
#> equation_group site family species species_code life_form equation_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Expert Lill~ Sapin~ Acer r~ 316 Tree 7c72ed
#> 2 Expert Lill~ Sapin~ Acer r~ 316 Tree 2060ea
#> 3 Expert Lill~ Sapin~ Acer s~ 318 Tree a4d879
#> 4 Expert Lill~ Rosac~ Amelan~ 356 Tree c59e03
#> 5 Expert Lill~ Rosac~ Amelan~ 356 Tree 96c0af
#> 6 Expert Lill~ Rosac~ Amelan~ 356 Tree 529234
#> 7 Expert Lill~ Jugla~ Carya ~ 409 Tree 9c4cc9
#> 8 Expert Lill~ Jugla~ Carya ~ 402 Tree 9c4cc9
#> 9 Expert Lill~ Jugla~ Carya ~ 403 Tree 9c4cc9
#> 10 Expert Lill~ Jugla~ Carya ~ 407 Tree 9c4cc9
#> # ... with 530 more rows, and 4 more variables: equation_taxa <chr>,
#> # notes_on_species <chr>, wsg_id <chr>, wsg_specificity <chr>
sitespecies %>%
filter(equation_group == "Generic") %>%
select(equation_group, everything())
#> # A tibble: 232 x 11
#> equation_group site family species species_code life_form equation_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Generic Lill~ Annon~ Asimin~ 367 Tree ae65ed
#> 2 Generic Lill~ Betul~ Carpin~ 391 Tree ae65ed
#> 3 Generic Lill~ Jugla~ Carya ~ <NA> Tree 1c1ac8
#> 4 Generic Lill~ Canna~ Celtis~ 462 Tree ae65ed
#> 5 Generic Lill~ Fabac~ Cercis~ 471 Tree 7f7777
#> 6 Generic Lill~ Rosac~ Cratae~ 500 Shrub f08fff
#> 7 Generic Lill~ Laura~ Linder~ 609 Shrub f08fff
#> 8 Generic Lill~ Paulo~ Paulow~ 712 Tree ae65ed
#> 9 Generic Lill~ Rosac~ Prunus~ 762 Tree f08fff
#> 10 Generic Lill~ Anaca~ Rhus t~ 899 Shrub ae65ed
#> # ... with 222 more rows, and 4 more variables: equation_taxa <chr>,
#> # notes_on_species <chr>, wsg_id <chr>, wsg_specificity <chr>
.
Erika has already reviewed the current species ...
Generic equations can also be applied to any site within a specified region (e.g., temperate North America), including for stems at ForestGEO sites where the DBH is greater than the upper DBH limit of the expert-selected equation.
Just noticing that I don't yet see a column encoding region.
library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
library(allodb)
master() %>%
select(matches("region"))
#> Joining `equations` and `sitespecies` by 'equation_id'; then `sites_info` by 'site'.
#> # A tibble: 769 x 0
.
For these, the site species table will contain records with: site = "any temperate NA" (or such) species = e.g., "Quercus sp." (any Quercus)
Okay, I'll have to work around missmatches. Remember that this is how the code allocates an equation from allodb to each dbh
value in the user data: The user's data is matched exactly to allodb data via the keys species
, site
, and equation_id
. I can massage your tables but just be aware that the more missmatches there are, the more complex the logic becomes (and the greater the change of bugs). For example:
I can convert site = "any"
in allodb on the fly to become site = <the site in the census data>
; that way, say, site = "scbi"
in allodb will match site = "scbi"
in the user's data.
"Genus species" in user's data won't match "Genus sp." in allodb. I can transform the string from allodb on the fly to match "genus" in allodb with "genus" on the user's data.
If on top of this you expect to use different kinds of generic equations, say "any-temperate", and "any-tropical", then the code needs to somehow know which species are temperate and which are tropical. That could take quite some time to do. .
Note that for species that are specifically assigned a generic equation, the record in the sitespecies table is superfluous from a coding perspective.
Can you say this again? I don't understand. There seems to be two kinds of generic equations. If this is the case, think if you can design the table in a way that is straight forward to code. The more ambiguity there is in a single variable (one column of a table in allodb encoding more than one thing) the harder it is to write readable and reliable code. .
However, its important data from the perspective of someone who may want to pull up the list of species and associated allometries for a given site, and to disambiguate between species that have never been reviewed (e.g., if a new species shows up at a site) and those that have been reviewed but found to have no specifically appropriate allometry.
How about you encode whether an equation id should not be used by the code? All I need is a 1-column table of the equations to skip.
equations_to_skip
"abc123"
"opq321"
...
.
Does this make sense? If it would be useful, I could add an example to the sitespecies table.
Thanks, an example will certainly help. Best is a minimal, fake example that captures the essense of what you want to convey.
I'll let this conversation develop a bit to better understand the next actions. For now, I'll drop the generic equaitons that shoudn't get mixed with expert equations (see https://github.com/forestgeo/fgeo.biomass/issues/28) which should immediately improve the accuracy of the biomass
estimates we get.
Please don't drop anything we currently have; its all wanted (i.e., Erika has determined those to be the best options). I'll clarify more when I get a chance.
By drop I mean exclude from the calculation. I won't touch anything in allodb.
Right now, the biomass
values are incorrectly summing biomass that comes from expert AND from generic equaitons. Only one of them should be used. And we don't still support generic equations, so I'll exclude them until we do -- meaning that the results we get right now are more correct.
# Now (incorrect)
rowid site species dbh equaiton equation_group biomass
1 "scbi" "Aaa aaa" 10 dbh * 10 "Generic" 100
1 "scbi" "Aaa aaa" 10 dbh * 10 "expert" 100
---
biomass result = 200
# Soon (correct)
rowid site species dbh equaiton equation_group biomass
1 "scbi" "Aaa aaa" 10 dbh * 10 "expert" 100
---
biomass result = 100
No, please don't exclude. The "Generic" equations are still expert-selected (i.e., identified by Erika as the best available). Do you have any real examples where a "Generic" and "Expert" equation are given for the same species at the same site, and at the same DBH? (I suppose there are some where max dbh of a small tree equation overlaps with min dbh of a large tree equation. Any others?) Perhaps rename these "generic" and "specific"? (But wait to see what Erika thinks.)
Here's an example of what we want (using a real example):
site | family | species | equation ID | equation_group | max DBH |
---|---|---|---|---|---|
scbi | Aceraceae | Acer rubrum | 7c72ed | Expert / Specific | 55 |
scbi | Aceraceae | Acer negundo | d6be5c | Generic | 66 |
any.temperate.NA | Aceraceae | any | d6be5c | Generic | 66 |
Here's how we want it to work in several cases: 1- 30 cm Acer rubrum at SCBI - use equation 7c72ed (row 1) 2- 30 cm Acer negundo at SCBI - use equation d6be5c (row 2) with knowledge that expert review has determined this to be the best available 3- 60 cm Acer rubrum at SCBI - use equation d6be5c (row 2) because there's no record of an appropriate equation for Acer rubrum >55cm at SCBI (but because there's a record for Acer rubrum at SCBI, we know it has been expert-reviewed) 4- 30 cm Acer newspecies (a species new to the plot in 2023 census) or Acer sp. (unidentified Acer) at SCBI- use equation d6be5c (row 3) because there are no records for the species in the sitespecies table 5- 30 cm Acer rubrum at [hypothetical future ForestGEO site in NA] - use equation d6be5c (row 3) because there are no records for the site in the sitespecies table (6- 70 cm Acer at any site -- TBD- its dangerous to extrapolate beyond limit of equation, but in some cases we'll have to.)
Do you have any real examples where a "Generic" and "Expert" equation are given for the same species at the same site, and at the same DBH?
Here is one example:
The row 236 of the user's data has a single tree of dbh = 143
that matches two equations in allodb, 7f7777
and 333c34
. Right now, the code calculates the biomass for each row independently and then sums them together to produce the single biomass
result for the rowid
236. This is approapriate when the two equations are for different parts of the same tree. But is not the case; here the resulting biomass (which sums bimass by rowid
) will overestimate the real biomass.
The temporary approach I suggest is to forget about generic equations until we can handle them correctly. Here, the code would on the fly drop the row where is_generic
is TRUE
and we are left with only the Expert equaitons.
What do you think?
library(tidyverse)
#> Warning: package 'purrr' was built under R version 3.5.3
library(allodb)
library(fgeo.biomass)
set.seed(1)
census <- fgeo.biomass::scbi_tree1 %>% dplyr::sample_n(1000)
species <- fgeo.biomass::scbi_species
census_species <- census %>% add_species(species, site = "scbi")
#> Adding `site`.
#> Overwriting `sp`; it now stores Latin species names.
#> Adding `rowid`.
bad <- allo_find(census_species)
#> Assuming `dbh` in [mm] (required to find dbh-specific equations).
#> * Searching equations according to site and species.
#> Warning: Can't find equations matching these species:
#> carya sp, quercus prinus, ulmus sp, unidentified unk
#> * Refining equations according to dbh.
#> Warning: Can't find equations for 664 rows (inserting `NA`).
bad %>%
select(
rowid, equation_id,
site,
sp,
dbh,
matches("dbh.*mm$"),
is_generic,
anatomic_relevance
) %>%
add_count(rowid) %>%
filter(n > 1 & rowid %in% c("236", "811", "336")) %>%
select(-n)
#> # A tibble: 6 x 9
#> rowid equation_id site sp dbh dbh_min_mm dbh_max_mm is_generic
#> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
#> 1 236 7f7777 scbi robi~ 143. 40 420 TRUE
#> 2 236 333c34 scbi robi~ 143. 142. 259. FALSE
#> 3 336 f08fff scbi prun~ 37.6 30 640 TRUE
#> 4 336 8aecca scbi prun~ 37.6 3.7 68.3 FALSE
#> 5 811 f08fff scbi sass~ 37.3 30 640 TRUE
#> 6 811 2c092b scbi sass~ 37.3 4 84.9 FALSE
#> # ... with 1 more variable: anatomic_relevance <chr>
Created on 2019-03-21 by the reprex package (v0.2.1)
This is a case, and there will be others like it, where there are two equations for different size classes that overlap in dbh range. This relates to issue #17, and ultimately I'd prefer to use the approach described there (i.e., switch equations at the point where they cross). Until that is done, please give precedence to the expert-selected equation.
Note that, in the long run, it won't be a stable solution to sum equations under the assumption that they describe different biomass components. Rather, we will need to create equations describing how the dependent_variable_biomass_component
s relate to one another and use that as the basis for summing.
... in the long run ... we will need to create equations describing how the dependent_variable_biomass_components relate to one another and use that as the basis for summing.
Sorry, I don't understand this, but you say it's now urgent so I'll let it sit for now.
I've made an issue (#82) to remind us of this later.
RE your https://github.com/forestgeo/allodb/issues/72#issuecomment-475373890
Awesome! Thanks for taking the time to develop an example. Its a great reminder of the basic logic you expect.
It should be clear by now, but I highlight that your comment (https://github.com/forestgeo/allodb/issues/72#issuecomment-475373890) describes decisions about different trees, whereas my example (https://github.com/forestgeo/allodb/issues/72#issuecomment-475378972) describes decisions about a single tree.
Following https://github.com/forestgeo/allodb/issues/73#issuecomment-476686185, here are the newly added generic-equations, and some comments and questions.
a. (https://github.com/forestgeo/fgeo.biomass/issues/31) site = "any temperate NA"
will be converted on the fly to \<the current site>, e.g. if the data comes from SCBI, all values of site
will be "scbi". This allows matching the equations by site.
b. Rows 3-10 require no action. They will be handled correctly once (https://github.com/forestgeo/fgeo.biomass/issues/31) is implemented. That is, the code already knows how to find equaitons for each row in a census dataset by matching allodb tables by site
(see a.) and species
.
c. Row 11 is also no problem. That equation will be used for every row of the census data containing "Abies sp." in the census dataset (i.e. when the user's ForestGEO census table has a code in sp
that points to Abies sp. in the ForestGEO species table).
d. Rows 1-2 are problematic. How will the code decide which of the two to use? @teixeirak, is your idea to match by family? i.e. if a tree belongs to Fabaceae match 1, and if it belongs to Junglandaceae match 2?
library(allodb)
library(tidyverse)
sitespecies %>%
filter(str_detect(site, "any")) %>%
select(site, equation_group, family, species, equation_id, equation_group)
#> # A tibble: 11 x 5
#> site equation_group family species equation_id
#> <chr> <chr> <chr> <chr> <chr>
#> 1 any temperate NA Generic Fabaceae <NA> 7f7777
#> 2 any temperate NA Generic Juglandaceae <NA> 1c1ac8
#> 3 any temperate NA Generic Pinaceae Abies balsamea 4872ed
#> 4 any temperate NA Generic Pinaceae Abies fraseri 4872ed
#> 5 any temperate NA Generic Pinaceae Abies lasiocar~ 4872ed
#> 6 any temperate NA Generic Pinaceae Abies amabilis 74dd65
#> 7 any temperate NA Generic Pinaceae Abies concolor 74dd65
#> 8 any temperate NA Generic Pinaceae Abies grandis 74dd65
#> 9 any temperate NA Generic Pinaceae Abies magnifica 74dd65
#> 10 any temperate NA Generic Pinaceae Abies procera 74dd65
#> 11 any temperate NA Generic Pinaceae Abies sp. 74dd65
Created on 2019-03-26 by the reprex package (v0.2.1)
Sorry for the delayed response. Somehow this landed in my junk folder.
Regarding (d), yes, the idea is to match by family. But I suppose you need a list of the genera in each family (filled in as in row 11)?
Thanks! Yes, you are right. The code can't know what species belong to which Family. If this means too much duplicated data, you may pull Family
out of sitespecies
into a table with columns species
and Family
, which I can join on the fly with sitespecies
via the key column species
.
library(tidyverse)
family <- tribble(
~family, ~species,
"Aea", "A a",
"Aea", "A b",
"Aea", "A c",
)
sitespecies <- tribble(
~species, ~more_columns,
"A a", "whatever",
"A c", "whatever",
)
left_join(sitespecies, family)
#> Joining, by = "species"
#> # A tibble: 2 x 3
#> species more_columns family
#> <chr> <chr> <chr>
#> 1 A a whatever Aea
#> 2 A c whatever Aea
Created on 2019-03-27 by the reprex package (v0.2.1)
I'll leave this up to @gonzalezeb . There will be multiple of instances with these generic equations where we'll need to list the genera in a family. I bet there's some existing resource that we could source.
Oh, sure, you make me realize that I can find all species in a family from the species table, which the code requires anyway to match codes with speices names. You can disregard my previous message.
fgeo.biomass::scbi_species
#> # A tibble: 73 x 10
#> sp Latin Genus Species Family SpeciesID Authority IDLevel syn subsp
#> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <lgl> <lgl>
#> 1 acne Acer~ Acer negundo Sapin~ 1 "" species NA NA
#> 2 acpl Acer~ Acer platan~ Sapin~ 2 "" species NA NA
#> 3 acru Acer~ Acer rubrum Sapin~ 3 "" species NA NA
#> 4 acsp Acer~ Acer sp Sapin~ 4 "" Multip~ NA NA
#> 5 aial Aila~ Aila~ altiss~ Simar~ 5 "" species NA NA
#> 6 amar Amel~ Amel~ arborea Rosac~ 6 "" species NA NA
#> 7 astr Asim~ Asim~ triloba Annon~ 7 "" species NA NA
#> 8 beth Berb~ Berb~ thunbe~ Berbe~ 8 "" species NA NA
#> 9 caca Carp~ Carp~ caroli~ Betul~ 9 "" species NA NA
#> 10 caco Cary~ Carya cordif~ Jugla~ 10 "" species NA NA
#> # ... with 63 more rows
Created on 2019-03-27 by the reprex package (v0.2.1)
@teixeirak, is NA
the most appropriate for rows 1-2 of the species
column?
I'm afraid its too inespecific. The process of cleaning data often involves removing missing values and duplicates. Cases where you mean "any species in family X" might be confused with other more general cases.
library(tidyverse)
library(allodb)
sitespecies %>%
filter(str_detect(site, "any")) %>%
select(site, family, species, equation_id)
#> # A tibble: 11 x 4
#> site family species equation_id
#> <chr> <chr> <chr> <chr>
#> 1 any temperate NA Fabaceae <NA> 7f7777
#> 2 any temperate NA Juglandaceae <NA> 1c1ac8
#> 3 any temperate NA Pinaceae Abies balsamea 4872ed
#> 4 any temperate NA Pinaceae Abies fraseri 4872ed
#> 5 any temperate NA Pinaceae Abies lasiocarpa 4872ed
#> 6 any temperate NA Pinaceae Abies amabilis 74dd65
#> 7 any temperate NA Pinaceae Abies concolor 74dd65
#> 8 any temperate NA Pinaceae Abies grandis 74dd65
#> 9 any temperate NA Pinaceae Abies magnifica 74dd65
#> 10 any temperate NA Pinaceae Abies procera 74dd65
#> 11 any temperate NA Pinaceae Abies sp. 74dd65
Created on 2019-03-27 by the reprex package (v0.2.1)
Assuming I understand your question right, we could replace "NA" with "any".
Wait. I could get a little bit more specific on the 'species' column for some generic equations. I will need to look for more details in another table in chojnacky publication.
That's what I did for Abies (to differentiate by wood gravity). I don't think it matters for Fabaceae/Juglandaceae, but I was a bit confused by their categories on that one and would appreciate your review.
I just closed forestgeo/fgeo.biomass#31 (Convert site = any temperate NA
to <current site>
). This means that from https://github.com/forestgeo/allodb/issues/72#issuecomment-476770717 the only case that is still on the TODO list is that of rows 1-2 (and you are helping with that, so great!).
. From https://github.com/forestgeo/allodb/issues/72#issuecomment-476770717:
Note that we may run into some other issues as we continue entering Chojanski equations. In particular, I'm not sure how to deal with the "woodland" category (that will make sense to @gonzalezeb, not @maurolepore ).
Chojnacky 2014 noted that the woodland equations seemed to predict low biomass values, based likely on errors in the old data they used. As long as we specify this in our publication we can include them. These equations use drc (diameter at root collar) so I will include the diameter within the allodb equation.
Closing because this issue was solved by the new weighting system.
We need to add the Chojnacky et al. 2014 equations to the
sitespecies
table, both to give users the option to go with the generic option and to handle issue #69.Necessary steps: 1- @gonzalezeb, add the equations to the table. For site, put something like "Any" or "NA". For species, fill in specificity for which equation is designed, (e.g., "Picea spp.). 2- @maurolepore, we'll then need a mechanism in the code to identify and assign these equations. They should be selected when (1) they are are only equation available for the species at a given DBH or (2) when user selects the generic equation option.