Clarification about low diversity of taxa across most frequently recorded traits.

edanweis commented 3 years ago

Disclaimer, I have no background in botany, etc. and I am not a data scientist.

According to my analysis it appears that there are relatively few species (taxa/names) among the studies observing some of the most frequently recorded traits, ie. roughly 29 species from 28,640. Why is this so?

Could it be because most research surveys fewer species but examine a greater amount of traits?

I realise this may not be the best forum for this question, but any help / referral would be appreciated, thanks.

Google Colab Notebook (Python)

dfalster commented 3 years ago

Thanks for the insight @edanweis.

As background, have you read the paper about the dataset? https://www.biorxiv.org/content/10.1101/2021.01.04.425314v1 An updated version will be published on Friday in Scientific Data at http://doi.org/10.1038/s41597-021-01006-6 (link won't work until Friday). Figure 2 in the new version shows that around 5-10 traits have data for more than > 10000 taxa.

Most studies collect data on few traits and species. The power of the dataset comes from putting all this together. A few sources have data on many taxa. There's no fixed list of traits to collect, so you won't find many studies that collect a common set of traits.

Can you tell us more about your interests in the data and what you're aiming to achieve?

Ping @ehwenk

edanweis commented 3 years ago

No worries.

around 5-10 traits have data for more than > 10000 taxa.

Yes, I have reproduced that in the data:

trait_name	distinct_taxa	%_of_all_taxa
plant_growth_form	25355	88.5%
life_history	23078	80.6%
fruit_type	22411	78.3%
sex_type	21205	74.0%
plant_height	17521	61.2%
flowering_time	17276	60.3%
leaf_length	14506	50.6%
leaf_width	14105	49.2%
leaf_compoundness	13719	47.9%

But beyond those traits unfortunately, there is a long tail of traits examining far fewer species, on log scale:

The paper mentions, standardising terms for categorical variables as part of data harmonisation, but does this include the trait names themselves? ie. are there trait synonyms, or ambiguities? An ontology was used, but I assume that doesn't include trait name values (Madin, J. et al. A)?

I'm interested in applying ML to predict horticulatural characteristics from morphological traits and natural language descriptions (word embeddings) as independent and dependant variables respectively with decision tree classifiers and then summarise their relative importance with shapley value decomposition. I was hoping there would be sufficient number of taxa and traits for this analyses to prove effective without much feature engineering dependent on expert knowledge.

ehwenk commented 3 years ago

The trait names are harmonised - each trait very represents a unique trait. Our definitions file includes a list of all the trait names and their definitions (http://traitecoevo.github.io/austraits.build/articles/Trait_definitions.html). Some of the terms allowed for categorical traits (trait values per our terminology) are synonymous and we're in the process of reviewing those over the next 18 months.

fontikar commented 2 years ago

@dfalster would it be ok to close this issue?

traitecoevo / austraits

Clarification about low diversity of taxa across most frequently recorded traits. #28