hurlbertlab / core-transient

Data and code for NSF funded research on core vs transient species
7 stars 3 forks source link

re-cleaning d207, species names #89

Closed ahhurlbert closed 8 years ago

ahhurlbert commented 8 years ago

@ssnell6 Every quadrat has the exact same number of records. This is not a problem per se, but points out a way that the data must be structured which has ramifications for us. The reason there are exactly 828 rows per quadrat is because every quadrat has rows for the exact same set of species and tissue types regardless of whether those species/tissues were observed or not. This should immediately raise suspicion in a core-transient context because it implies that only a fixed set of species were being monitored/recorded. It turns out that less commonly observed species are recorded under Species categories like "Other Evergreen (Note species in comments)", or "Other sedges", "Forbs spp.". When the species name is actually provided in the Species.Comments field, then this should probably be pulled out and assigned to the species column (there are only 4 unique comments in the Species.Comments field so this should be easy).

Otherwise, I would actually treat all of the "Other" categories as valid species rather than removing them with bad_sp. (You'll still need to merge, e.g. "Other Evergreen" with "Other Evergreen (Note species in comments)", etc. using typo_name and good_name.) If the occurrence of these categories was common, we would run the risk of assigning multiple transient species to a single "core" species name, however, I am ok doing this because they seem to occur infrequently enough (where biomass > 0) that this is unlikely.

Note that 2 species names actually have a space at the end: "Forbs spp. " and "Other forbs (Note species in comments) ".

ssnell6 commented 8 years ago

fixed the species issue with a for loop. removed bad species and checked spacing