Closed edanweis closed 2 years ago
Thanks for the insight @edanweis.
As background, have you read the paper about the dataset? https://www.biorxiv.org/content/10.1101/2021.01.04.425314v1 An updated version will be published on Friday in Scientific Data at http://doi.org/10.1038/s41597-021-01006-6 (link won't work until Friday). Figure 2 in the new version shows that around 5-10 traits have data for more than > 10000 taxa.
Most studies collect data on few traits and species. The power of the dataset comes from putting all this together. A few sources have data on many taxa. There's no fixed list of traits to collect, so you won't find many studies that collect a common set of traits.
Can you tell us more about your interests in the data and what you're aiming to achieve?
Ping @ehwenk
No worries.
around 5-10 traits have data for more than > 10000 taxa.
Yes, I have reproduced that in the data:
trait_name | distinct_taxa | %_of_all_taxa |
---|---|---|
plant_growth_form | 25355 | 88.5% |
life_history | 23078 | 80.6% |
fruit_type | 22411 | 78.3% |
sex_type | 21205 | 74.0% |
plant_height | 17521 | 61.2% |
flowering_time | 17276 | 60.3% |
leaf_length | 14506 | 50.6% |
leaf_width | 14105 | 49.2% |
leaf_compoundness | 13719 | 47.9% |
But beyond those traits unfortunately, there is a long tail of traits examining far fewer species, on log scale:
The paper mentions, standardising terms for categorical variables as part of data harmonisation, but does this include the trait names themselves? ie. are there trait synonyms, or ambiguities? An ontology was used, but I assume that doesn't include trait name values (Madin, J. et al. A)?
I'm interested in applying ML to predict horticulatural characteristics from morphological traits and natural language descriptions (word embeddings) as independent and dependant variables respectively with decision tree classifiers and then summarise their relative importance with shapley value decomposition. I was hoping there would be sufficient number of taxa and traits for this analyses to prove effective without much feature engineering dependent on expert knowledge.
The trait names are harmonised - each trait very represents a unique trait. Our definitions file includes a list of all the trait names and their definitions (http://traitecoevo.github.io/austraits.build/articles/Trait_definitions.html). Some of the terms allowed for categorical traits (trait values
per our terminology) are synonymous and we're in the process of reviewing those over the next 18 months.
@dfalster would it be ok to close this issue?
Disclaimer, I have no background in botany, etc. and I am not a data scientist.
According to my analysis it appears that there are relatively few species (taxa/names) among the studies observing some of the most frequently recorded traits, ie. roughly 29 species from 28,640. Why is this so?
Could it be because most research surveys fewer species but examine a greater amount of traits?
I realise this may not be the best forum for this question, but any help / referral would be appreciated, thanks.
Google Colab Notebook (Python)