jesstytam / honours

0 stars 0 forks source link

list of datasets and sources #2

Closed wcornwell closed 4 years ago

wcornwell commented 4 years ago
wcornwell commented 4 years ago

i will do GBIF

itchyshin commented 4 years ago

Will get something for body size - assigned to @itchyshin - but @jessicatytam please try to find the latest paper with the biggest dataset and put it here

mlagisz commented 4 years ago

I will do phylogeny on rotl L

On Wed, Sep 16, 2020 at 12:19 PM Shinichi Nakagawa notifications@github.com wrote:

Will get something for body size - assigned to @itchyshin https://github.com/itchyshin - but @jessicatytam https://github.com/jessicatytam please try to find the latest paper with the biggest dataset and put it here

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jessicatytam/honours/issues/2#issuecomment-693128376, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDIDL7OWHMQSO2BJJFJL5TSGAN7FANCNFSM4ROAYOCQ .

wcornwell commented 4 years ago

metadata for gbif download:

Screen Shot 2020-09-16 at 2 27 35 pm
wcornwell commented 4 years ago

processing steps:

library(data.table)
library(dplyr)
library(readr)

z<-fread("data/0062362-200613084148143.csv")
z2<-select(z,order,family,genus,species,scientificName,decimalLatitude,decimalLongitude,year,issue)
rm(z)
gc()

z3<-filter(z2,!grepl("COUNTRY_COORDINATE_MISMATCH",issue)&
             !grepl("ZERO_COORDINATE",issue)&
             !grepl("COORDINATE_INVALID",issue)&
             !grepl("COUNTRY_MISMATCH",issue)&
             !grepl("COORDINATE_OUT_OF_RANGE",issue))
#excludes 300,000 records

z4<-select(z3,scientificName,decimalLatitude,decimalLongitude,year)

write_csv(z4,"data/gbif_processed.csv")

file is too big to add to github but you can download it here: https://www.dropbox.com/s/6lv44ap17v5amy1/gbif_processed.csv.zip?dl=0

@jessicatytam please check if this is too big to read in on your computer--unzipped it's about 970MB. if it is, then i can split into pieces.

be very interesting to see if h-index corresponds to number of records in gbif. that would be a result already

wcornwell commented 4 years ago

there are still extinct things in that dataset...not sure how to exclude them at this point....

jesstytam commented 4 years ago

Body mass databases

PanTHERIA (http://esapubs.org/archive/ecol/E090/184/#data)

AnAge database (https://genomics.senescence.info/species/browser.php?type=2&name=Mammalia)

Smith et al. (https://knb.ecoinformatics.org/view/doi:10.5063/AA/nceas.196.3)

Quaardvark (https://animaldiversity.ummz.umich.edu/quaardvark/search/)

itchyshin commented 4 years ago

@jessicatytam maybe this one too - can you check this out too?

https://animaldiversity.org/

jesstytam commented 4 years ago

@itchyshin looks like it only has some generic description to the whole group instead of the actual numbers

itchyshin commented 4 years ago

They do have specific info - for example, koala - and you can scrape info from an underlying database for this website

https://animaldiversity.org/accounts/Phascolarctos_cinereus/

Range mass 5.1 to 11.8 kg 11.23 to 25.99 lb

jesstytam commented 4 years ago

ohh ok i see, i'll find out if there is a way to download that in bulk, thanks!

mlagisz commented 4 years ago

library(rotl) library(ape)

taxa <- tnrs_match_names("Mammalia") #find iTOL record for Mammalia res <- tol_subtree(ott_id = taxa$ott_id, label_format = "name") #extract subtree of mammals str(res)

9351 tip labels correspond to species, e.g.:

res$tip.label[1000:2000] res$tip.label <- gsub("_"," ", res$tip.label) #get rid of the underscores res$tip.label[1000:2000]

Note that some further cleaning will required, esp. for lineages and subspecies:

hist(lengths(gregexpr("\W+", res$tip.label)) + 1, xlab = "number of words in species names", main="mammalian tree tip labels") #histogram of how many words are in the species names from the tree table((lengths(gregexpr("\W+", res$tip.label)) + 1)) #table for the above

We have 6952 "clean binomial" names and approx. 2000 non-binomial ones, many of the latter could potentially be collapsed to a higher taxonomic level?

Plotting the tree (its too large, takes a while!):

plot(res, labels=FALSE)

Once we have a cleaned list of species on the tree, we can clean the tree and also use tnrs_match_names() to get their ott_id and then get other id types (NCBI, worms, gbif, irmng) using taxon_external_IDs() function.

wcornwell commented 4 years ago

closing in favor of #3 and #4