traitecoevo / austraits

R package for accessing the AusTraits Plant database and working with traits.build databases
https://traitecoevo.github.io/austraits/
MIT License
19 stars 1 forks source link

Clarification about low diversity of taxa across most frequently recorded traits. #28

Closed edanweis closed 2 years ago

edanweis commented 3 years ago

Disclaimer, I have no background in botany, etc. and I am not a data scientist.

According to my analysis it appears that there are relatively few species (taxa/names) among the studies observing some of the most frequently recorded traits, ie. roughly 29 species from 28,640. Why is this so?

Could it be because most research surveys fewer species but examine a greater amount of traits?

I realise this may not be the best forum for this question, but any help / referral would be appreciated, thanks.

Google Colab Notebook (Python)

image

image

dfalster commented 3 years ago

Thanks for the insight @edanweis.

As background, have you read the paper about the dataset? https://www.biorxiv.org/content/10.1101/2021.01.04.425314v1 An updated version will be published on Friday in Scientific Data at http://doi.org/10.1038/s41597-021-01006-6 (link won't work until Friday). Figure 2 in the new version shows that around 5-10 traits have data for more than > 10000 taxa.

Most studies collect data on few traits and species. The power of the dataset comes from putting all this together. A few sources have data on many taxa. There's no fixed list of traits to collect, so you won't find many studies that collect a common set of traits.

Can you tell us more about your interests in the data and what you're aiming to achieve?

Ping @ehwenk

edanweis commented 3 years ago

No worries.

around 5-10 traits have data for more than > 10000 taxa.

Yes, I have reproduced that in the data:

trait_name distinct_taxa %_of_all_taxa
plant_growth_form 25355 88.5%
life_history 23078 80.6%
fruit_type 22411 78.3%
sex_type 21205 74.0%
plant_height 17521 61.2%
flowering_time 17276 60.3%
leaf_length 14506 50.6%
leaf_width 14105 49.2%
leaf_compoundness 13719 47.9%

But beyond those traits unfortunately, there is a long tail of traits examining far fewer species, on log scale:

image

The paper mentions, standardising terms for categorical variables as part of data harmonisation, but does this include the trait names themselves? ie. are there trait synonyms, or ambiguities? An ontology was used, but I assume that doesn't include trait name values (Madin, J. et al. A)?

I'm interested in applying ML to predict horticulatural characteristics from morphological traits and natural language descriptions (word embeddings) as independent and dependant variables respectively with decision tree classifiers and then summarise their relative importance with shapley value decomposition. I was hoping there would be sufficient number of taxa and traits for this analyses to prove effective without much feature engineering dependent on expert knowledge.

ehwenk commented 3 years ago

The trait names are harmonised - each trait very represents a unique trait. Our definitions file includes a list of all the trait names and their definitions (http://traitecoevo.github.io/austraits.build/articles/Trait_definitions.html). Some of the terms allowed for categorical traits (trait values per our terminology) are synonymous and we're in the process of reviewing those over the next 18 months.

fontikar commented 2 years ago

@dfalster would it be ok to close this issue?