globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

How can I use nomer to translate verbatim names to normalized names? #84

Closed zedomel closed 2 years ago

zedomel commented 2 years ago

Hi @jhpoelen

I have this two columns files with original/verbatim names and normalized names and I would like to use nomer to mapping verbatim names to normalized names.

 pseudosericea × Potentilla jepsonii    Potentilla jepsonii
" Acourtia thurberi Acourtia thurberi
" Fraxinus velutina Fraxinus velutina
" Mandeville brachysiphon   Mandeville brachysiphon
" Nolina microcarpa Nolina microcarpa
" Rhus aromatica    Rhus aromatica
" Sarcomphalus obtusifolius Sarcomphalus obtusifolius
" Schizachyrium sp. Schizachyrium
" and (probably planted) Enterolobium cyclocarpum   Enterolobium cyclocarpum
"(occasional) Hydrodiction utriculatum" Hydrodiction utriculatum

I have used grep for that, but it is very slow. You talked about using translate-names matcher and I'm wondering if it can be extended to provide the whole classification (taxonomic ranks). For example if a I have this files:

Hilaria j,Hilaria,Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Hilaria,kingdom|phylum|class|order|family|genus

where the first column is the verbatim name, how can I use translate-names to get all the data up to the second column when a match is found?

The solution that I found was to provide a two column mapping file for nomer.taxon.name.correction.url where I provided the corrected name + full hierarchy in the second column separated by a delimiter (e.g. #). After runing nomer replace translate-names I replaced this dummy delimiter by a actual field delimiter (\t):

nomer.taxon.name.correction.url file:

Hilaria jamesii,Hilaria jamesii#Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Hilaria|Hilaria jamesii#kingdom|phylum|class|order|family|genus|species

command: cat names.tsv | nomer replace translate-names | sed 's/#/\t/g' > names-translated.csv

thanks.

jhpoelen commented 2 years ago

hey @zedomel -

you can use the translate names as you suggested.

if you'd like to include an taxonomic hierarchy in the translation, you might benefit from using the globi matcher after pointing the nomer properties for that matcher to your own taxonMap and taxonCache.

For instance, you can say:

echo -e "\tDonald duckus" | nomer append --properties my.properties globi

with my.properties containing like:

nomer.term.cache.url=https://zenodo.org/record/6394935/files/taxonCacheFirst10.tsv nomer.term.map.url=https://zenodo.org/record/6394935/files/taxonMapFirst10.tsv j

the taxonMap make a naive map of provide id/name -> resolve id/name

and taxonCache includes additional information for resolved id/names.

for schema, see provided example.

Let me know if you need more help to get started, or whether you have any suggestions.

jhpoelen commented 2 years ago

@zedomel I am assuming I answered your question on how to translate verbatim names to normalized names using Nomer.

If not, please comment and share your thoughts on how to better support the name translation.