jtmiller28 / desert-modeling

A repo for modeling North American desert's plants & bees
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Check Taxonomic Composition of samples for skews #1

Open jtmiller28 opened 11 months ago

jtmiller28 commented 11 months ago

Identify any taxonomic biases in the data. The current filter applied only requires at least 100 specimen records be present per species within the Sonoran, however; assurance that this taxonomic composition isn't completely biased should be looked into.

Conduct an Ordinal Analysis on taxonomic composition related to specimen sampling.

jtmiller28 commented 11 months ago

Conducted an analysis at the Family level: Species we're associated with their corresponding families using the World Flora Online backbone. Number of specimens were then summed up per family. Filtered (the proposed 100 threshold) and Unfiltered datasets were then compared, where filtered datasets were adjusted to have zero records for families that were completely filtered out.

filtered-taxa-comparison unfiltered-taxa-comparison

A chi-squared analysis was also conducted on the numeric sum of family associated specimens for filtered (Obs) vs unfiltered (Exp). H0: There is no sig diff in the family composition between the filtered and unfiltered datasets H1: There is a sig diff in the family composition between the filtered and unfiltered datasets

Due to the number of small differences the initial test produced a warning that the approximation may be incorrect. This was supplemented by turning on the Simulate p value argument to compute p-values by Monte Carlo simulation. Results: X-squared = 7490.7, df = NA, p-value = 0.0004998

Suggesting that the family compositions of our filtered and unfiltered datasets are not similar.

jtmiller28 commented 11 months ago

Attempted to adjust the filtering with n >= 100, 75, 50, 20, 10, and 5. All adjustments besides n >= 5 recovered identical pvalues for the chisquare test, indicating that this is around the threshold necessary to make the datasets similar. The likely reason for this is that some families are only represented by a sparse sampling of particular species.

My assumption is that it would be more deleterious to the models to include these species (as 5 records is non-informative for our modeling). We will proceed with the caveat that our filtered sampling is not fully taxonomically representative of the Sonoran as a whole for angiosperms, but representative of the taxa that are better sampled.