jhpoelen / hmw

(experimental) Machine readable version of Handbook of the Mammals of the World
https://jhpoelen.nl/hmw/
Creative Commons Zero v1.0 Universal
5 stars 2 forks source link

create a method to automatically align HMW with batnames #4

Open jhpoelen opened 2 years ago

jhpoelen commented 2 years ago

@ajacsherman to provide specific examples.

jhpoelen commented 2 years ago

see related https://github.com/globalbioticinteractions/nomer/issues/90 .

jhpoelen commented 2 years ago

@ajacsherman

with a first pass at Nomer's support for batnames (see https://github.com/globalbioticinteractions/nomer/issues/91), you can now align hmw with batnames in less than 2 seconds:

$ time cat hmw.json | grep Chiroptera | jq --raw-output .name | sed 's/^/\t/g' | nomer append --include-header batnames | mlr --itsvlite --ocsv cat > hmw-batbase.csv
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [batnames]

real    0m1.595s
user    0m4.082s
sys 0m0.228s

See attached hmw-batnames.csv hmw-batnames.csv

jhpoelen commented 2 years ago

There appear to be 131 names in HMW that are not included in the batnames. Note however, that the current support for batnames does not include synonym matching.

$ cat hmw-batnames.tsv | grep NONE | cut -f2

yields

Nycteris madagascariensis Carollia brevicaudum Anoura aequatoris Anoura peruana Lophostoma silvicola Tonatia saurophila Leptonyctenis curasoae Mucronycteris schmidtorum Lonchorhina fernandez Artibeus gnomus Platyriunus aquilus Platyrrhinus chocoensis Artibeus toltecus Artibeus phaeotis Artibeus watsoni Artibeus rosenbergi Artibeus glaucus Artibeus bogotensis Artibeus aztecus Artibeus ravus Artibeus cinereus Artibeus anderseni Chiroderma vizottoi Platyrrhainus brachycephalus Chironax tumulus Rousettus celebensis Lissonycteris angolensis Doryrhina semoni Doryrhina muscinus Doryrhina stenotis Doryrhina wollastoni Doryrhina edwardshilli Doryrhina corynophyllus Hipposideros pratti Anthops omatus Emballonura seni Emballonura beccarli Tadarida latouchet Mops trevor Chaerephon jobensis Chaerephon nigeriae Chaerephon major Chaerephon tomensis Chaerephon pumilus Chaerephon russatus Chaerephon pusillus Chaerephon atsinanana Chaerephon jobimena Chaerephon bregullae Chaerephon plicatus Chaerephon johorensis Chaerephon solomonis Chaerephon bemmeleni Chaerephon ansorgei Chaerephon aloysiisabaudiae Chaerephon gallagheri Chaerephon chapini Chaerephon bivittatus Chaerephon leucogaster Natalus lanatus Chilonatalus tumaidifrons Myotis browni Myotis peninsularis Myotis keenii Myotis melanorhinus Murina huttonii Kerivoula malpasi Kerivoula crypta Plecotus christii Rhogeessa gracilis Rhogeessa alleni Histiotus diaphanopterus Histiotus humboldti Histiotus velatus Histiotus magellanicus Histiotus laephotis Histiotus alienus Histiotus montanus Histiotus macrotus Glauconycteris poensts Neoromicia helios Neoromicia grandidieri Neoromicia roseveari Neoromicia isabella Neoromicia rendalli Neoromicia brunnea Nycticeinops schlieffenii Hypsugo bemainty Hypsugo crassulus Hypsugo anchieta Neoromicia nanus Neoromicia stanleyi Neoromicia robertsi Neoromicia matroka Neoromicia capensis Neoromicia malagasyensis Hypsugo joffrei Nyctophilus macrotis Pipustrellus adamsi Pipustrellus wattsi Pipustrellus papuanus Pipustrellus abramus Puipustrellus raceyi Pipustrellus aero Pipustrellus rusticus Pipustrellus hesperidus Pipustrellus paterculus Pipustrellus minahassae Pipustrellus ceylonicus Pipustrellus javanicus Pipustrellus endoi Pipustrellus sturdeei Pipustrellus hanaki Pipustrellus pipistrellus Pipustrellus pygmaeus Rhinolophus deckend Rhinolophus alticolus Rhinolophus gorongosae Rhinolophus odami Rhinolophus comutus Rhinolophus achilles Rhinolophus kahuzi Rhinolophusarcuatus Rhinolophus cognotus Rhinolophus monticolus Rhinolophus perniger Miniopterus eschscholtzii Miniopterus blepotis Muniopterus mossambicus Miniopterus arenarius Genetta genetta

@ajacsherman are these results expected?

jhpoelen commented 2 years ago

With this, you can also do alignments with other name lists, like @n8upham 's https://mammaldiversity.org@Nathan Upham (mammaldiversity.org) Please note you can now also make quick comparisons between MDD and batnames -

$ curl "https://raw.githubusercontent.com/mammaldiversity/mammaldiversity.github.io/master/_data/mdd.csv" | mlr --csv filter '$order == "CHIROPTERA"' | mlr --csv cut -f sciName | sed 's/_/ /g' | sed 's/^/\t/g' | nomer append batnames | grep NONE | head 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5670k  100 5670k    0     0  17.8M      0 --:--:-- --:--:-- --:--:-- 17.8M
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [batnames]
    sciName NONE        sciName                     
    Chironax tumulus    NONE        Chironax tumulus        
    Lissonycteris angolensis    NONE        Lissonycteris angolensis
    Coelops hirsutus    NONE        Coelops hirsutus        
    Doryrhina corynophyllus NONE        Doryrhina corynophyllus     
    Doryrhina edwardshilli  NONE        Doryrhina edwardshilli      
    Doryrhina muscinus  NONE        Doryrhina muscinus      
    Doryrhina semoni    NONE        Doryrhina semoni        
    Doryrhina stenotis  NONE        Doryrhina stenotis      
    Doryrhina wollastoni    NONE        Doryrhina wollastoni        

61 names are defined in MDD that are not accepted in batnames.

according to

$ curl "https://raw.githubusercontent.com/mammaldiversity/mammaldiversity.github.io/master/_data/mdd.csv" | mlr --csv filter '$order == "CHIROPTERA"' | mlr --csv cut -f sciName | sed 's/_/ /g' | sed 's/^/\t/g' | tail -n+2 | nomer append batnames | grep NONE | wc -l
61