globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Repackaged Taxonomic Backbone of Global Biodiversity Information Facility (GBIF) #86

Closed zedomel closed 8 months ago

zedomel commented 2 years ago

Hi @jhpoelen

I'm trying to create a new version of gbif backbone to use in nomer. I download the script in the repository https://doi.org/10.15468/39omei and fix some minor errors:

I ran the script and it produce the expected files which I put in a new repository https://zenodo.org/record/6707049.

Then I update the nomer.properties file:

...
nomer.gbif.ids=gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv
nomer.gbif.names=gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-name.tsv.gz!/gbif-backbone-by-name.tsv

But when I run: echo -e "\tDunderbergia granulosa" | nomer append gbif -p nomer.properties

The following error is produced:

[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gbif-taxon]
[main] INFO org.globalbioticinteractions.nomer.match.GBIFTaxonService - indexing GBIF taxonomy ids...
[main] INFO org.globalbioticinteractions.nomer.match.ResourceServiceContentBased - using local Preston data dir: [/home/jose_asalim_gmail_com/test/./.nomer/data]
[main] INFO org.globalbioticinteractions.nomer.match.ResourceServiceContentBased - caching [gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv] at [/home/jose_asalim_gmail_com/test/./.nomer/tmp/nomer109046510492550888.gz]...
java.io.IOException: problem retrieving [gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv]
    at bio.guoda.preston.cmd.CmdGet.handleContentQuery(CmdGet.java:76)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:52)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:39)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:35)
    at org.globalbioticinteractions.nomer.match.ResourceServiceContentBased.retrieve(ResourceServiceContentBased.java:83)
    at org.globalbioticinteractions.nomer.match.ResourceServiceFactoryImpl$1.retrieve(ResourceServiceFactoryImpl.java:40)
    at org.globalbioticinteractions.nomer.match.TermMatcherContextCaching.retrieve(TermMatcherContextCaching.java:16)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService$3.hasNext(GBIFTaxonService.java:349)
    at org.mapdb.DB.createTreeMap(DB.java:872)
    at org.mapdb.DB$BTreeMapMaker.make(DB.java:661)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.indexIds(GBIFTaxonService.java:379)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.indexIfNeeded(GBIFTaxonService.java:199)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.lazyInit(GBIFTaxonService.java:176)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.lazyInitIfNeeded(GBIFTaxonService.java:161)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.match(GBIFTaxonService.java:62)
    at org.eol.globi.service.TermMatcherHierarchical.match(TermMatcherHierarchical.java:57)
    at org.globalbioticinteractions.nomer.util.AppendingRowHandler.onRow(AppendingRowHandler.java:35)
    at org.globalbioticinteractions.nomer.match.MatchUtil.apply(MatchUtil.java:87)
    at org.globalbioticinteractions.nomer.match.MatchUtil.match(MatchUtil.java:39)
    at org.globalbioticinteractions.nomer.cmd.CmdAppend.run(CmdAppend.java:20)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at org.globalbioticinteractions.nomer.Nomer.run(Nomer.java:57)
    at org.globalbioticinteractions.nomer.Nomer.main(Nomer.java:46)
Caused by: java.io.IOException: [gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv] not found.
    at bio.guoda.preston.cmd.CmdGet.handleContentQuery(CmdGet.java:72)
    ... 29 more
java.lang.RuntimeException: java.io.IOException: problem retrieving [gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv]
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:57)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:39)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:35)
    at org.globalbioticinteractions.nomer.match.ResourceServiceContentBased.retrieve(ResourceServiceContentBased.java:83)
    at org.globalbioticinteractions.nomer.match.ResourceServiceFactoryImpl$1.retrieve(ResourceServiceFactoryImpl.java:40)
    at org.globalbioticinteractions.nomer.match.TermMatcherContextCaching.retrieve(TermMatcherContextCaching.java:16)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService$3.hasNext(GBIFTaxonService.java:349)
    at org.mapdb.DB.createTreeMap(DB.java:872)
    at org.mapdb.DB$BTreeMapMaker.make(DB.java:661)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.indexIds(GBIFTaxonService.java:379)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.indexIfNeeded(GBIFTaxonService.java:199)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.lazyInit(GBIFTaxonService.java:176)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.lazyInitIfNeeded(GBIFTaxonService.java:161)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.match(GBIFTaxonService.java:62)
    at org.eol.globi.service.TermMatcherHierarchical.match(TermMatcherHierarchical.java:57)
    at org.globalbioticinteractions.nomer.util.AppendingRowHandler.onRow(AppendingRowHandler.java:35)
    at org.globalbioticinteractions.nomer.match.MatchUtil.apply(MatchUtil.java:87)
    at org.globalbioticinteractions.nomer.match.MatchUtil.match(MatchUtil.java:39)
    at org.globalbioticinteractions.nomer.cmd.CmdAppend.run(CmdAppend.java:20)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at org.globalbioticinteractions.nomer.Nomer.run(Nomer.java:57)
    at org.globalbioticinteractions.nomer.Nomer.main(Nomer.java:46)
Caused by: java.io.IOException: problem retrieving [gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv]
    at bio.guoda.preston.cmd.CmdGet.handleContentQuery(CmdGet.java:76)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:52)
    ... 28 more
Caused by: java.io.IOException: [gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv] not found.
    at bio.guoda.preston.cmd.CmdGet.handleContentQuery(CmdGet.java:72)
    ... 29 more

Looks that the file gz:https://zenodo.org/record/6707049/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv was not found, but it exists and I can download it using wget

How can I change nomer configuration to use this new version of GBIF backbone?

thanks josé.

jhpoelen commented 2 years ago

Hey José @zedomel - Thanks for your detailed message.

By default, Nomer uses a versioned copy of taxonomic resources as captured by Preston in Nomer's Corpus of Taxonomic Resources. So, instead of using the (dynamic and often changing) internet, Nomer relies on a well-defined versioned slice of it. And, your newer copy of GBIF's backbone it's defined in that slice (see e.g. list tracked urls/aliases in https://zenodo.org/record/6473194 ).

To disable Nomer's reliance on it's versioned corpus, you can blank out the preston properties, by changing -

$ nomer properties | grep preston
nomer.preston.dir=
nomer.preston.remotes=https://zenodo.org/record/6473194/files
nomer.preston.version=hash://sha256/d58ab1acf350f056a75bde7f4175d14c5e4dfaf0bf20e2eedbb2fb585bdf0822

to

nomer properties | grep preston
nomer.preston.dir=
nomer.preston.remotes=
nomer.preston.version=

After that reconfiguration, you'd sample the internet directly, and Nomer should download directly from the internet location.

Apologies for the confusion.

jhpoelen commented 2 years ago

@zedomel alternatively, we could build a new version of Nomer's Corpus of Taxonomic Resources that points to your more recent copy of the GBIF backbone in addition to updating the default Nomer property config.

Perhaps easier?

Let me know.

zedomel commented 2 years ago

Thank for quick answer @jhpoelen .

Build a new version sounds interesting, but for now I will blank out preston properties. I'm testing if a new version of GBIF taxonomy will provide more matches compared to current one used in nomer.

thanks.

jhpoelen commented 2 years ago

@zedomel sounds good! Curious to hear the outcome and eager use your work to include in a future version Nomer's Corpus of Taxonomic Resources.

jhpoelen commented 2 years ago

@zedomel I just updated Nomer's defaults to point to your recent repackaged GBIF backbone taxonomy. Hoping to include it into the next release of Nomer's Taxonomic Corpus.

jhpoelen commented 2 years ago

@zedomel

Salim, JA. (2022). A Repackaged Taxonomic Backbone of Global Biodiversity Information Facility (GBIF) - 2021-11-26 (0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6707049

has been included in :

Poelen, Jorrit H. (2022). Nomer Corpus of Taxonomic Resources hash://sha256/f4e2b9806440d0605f60b81feb9782655291aac2d000c74e4e8fdeb937e29b1d (0.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7065661

zedomel commented 8 months ago

@jhpoelen Can you update nomer to use a new version of GBIF Backbone: https://zenodo.org/doi/10.5281/zenodo.10810437.

I'm uisng 0.5.6 and after blank out preston options:

nomer.gbif.ids=gz:https://zenodo.org/record/10810438/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv
nomer.gbif.names=gz:https://zenodo.org/record/10810438/files/gbif-backbone-by-name.tsv.gz!/gbif-backbone-by-name.tsv
nomer.preston.dir=
nomer.preston.remotes=
nomer.preston.version=

and executing: echo -e "\tAchnanthes hauckiana" | nomer append gbif -p /tmp/append.properties

the follow exception is returned:

[main] INFO org.globalbioticinteractions.nomer.match.GBIFTaxonService - [GBIF] indexing taxonomy...
[main] INFO org.globalbioticinteractions.nomer.match.GBIFTaxonService - [GBIF] indexing ids...
[main] INFO org.globalbioticinteractions.nomer.match.ResourceServiceReadOnly - using cached [gz:https://zenodo.org/record/10810438/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv] at [/home/jose/.cache/nomer/f12008cf88c998ee4b765068a8a4c98b1601a0276f3d91f39087f75dbcd7f54b.gz]
java.lang.IllegalArgumentException: Name already used: nodes
    at org.mapdb.DB.checkNameNotExists(DB.java:1592)
    at org.mapdb.DB.createTreeMap(DB.java:834)
    at org.mapdb.DB$BTreeMapMaker.make(DB.java:661)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.buildTaxonIndex(GBIFTaxonService.java:221)
    at org.globalbioticinteractions.nomer.match.GBIFTaxonService.lazyInit(GBIFTaxonService.java:82)
    at org.globalbioticinteractions.nomer.match.CommonTaxonService.checkInit(CommonTaxonService.java:369)
    at org.globalbioticinteractions.nomer.match.CommonTaxonService.enrichNameMatches(CommonTaxonService.java:307)
    at org.globalbioticinteractions.nomer.match.CommonTaxonService.match(CommonTaxonService.java:100)
    at org.eol.globi.service.TermMatcherHierarchical.match(TermMatcherHierarchical.java:57)
    at org.globalbioticinteractions.nomer.util.AppendingRowHandler.onRow(AppendingRowHandler.java:42)
    at org.globalbioticinteractions.nomer.match.MatchUtil.apply(MatchUtil.java:85)
    at org.globalbioticinteractions.nomer.match.MatchUtil.match(MatchUtil.java:37)
    at org.globalbioticinteractions.nomer.cmd.CmdAppend.run(CmdAppend.java:20)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at org.globalbioticinteractions.nomer.Nomer.run(Nomer.java:57)
    at org.globalbioticinteractions.nomer.Nomer.main(Nomer.java:46)

Am I doing something wrong?

thanks.

jhpoelen commented 8 months ago

You did nothing wrong . . . however, the zenodo folks have changed their url syntax, so you'd have to update the endpoints accordingly.

jhpoelen commented 8 months ago

actually . . . did you run a nomer clean first? Also, what version are you using?

zedomel commented 8 months ago

Yes I did nomer clean

nomer version
0.5.6
jhpoelen commented 8 months ago

what version of nomer are you using?

jhpoelen commented 8 months ago

0.5.6 right?

jhpoelen commented 8 months ago

Ok, I'll try and reproduce. Just a minute.

zedomel commented 8 months ago

I'm going home right now. When I reach there, I will try again...

thanks

jhpoelen commented 8 months ago

Ok, am working on it.

jhpoelen commented 8 months ago

I've created a new Nomer v0.5.7 with your updated GBIF backbone taxonomy.

Please confirm that you can now use the packaged GBIF version ok by closing the issue.

jhpoelen commented 8 months ago

Would you be interested to learn how to do your own Nomer corpus and Nomer releases? I think it may be wise to spread the work a little.

zedomel commented 8 months ago

Thank you @jhpoelen . I will test it and let you know. The gbif catalogue was built using your code at: https://github.com/jhpoelen/repackage-gbif-backbone

thanks

zedomel commented 8 months ago

@jhpoelen

There is something wrong when installing the new version of nomer:

sudo sh -c '(curl -L https://github.com/globalbioticinteractions/nomer/releases/download/0.5.7/nomer.jar) > /usr/local/bin/nomer && chmod +x /usr/local/bin/nomer && nomer install-manpage' && nomer clean && nomer version

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 88.1M  100 88.1M    0     0  61.6M      0  0:00:01  0:00:01 --:--:-- 74.6M
sh: 1: nomer: Exec format error

The executable /usr/local/bin/nomer is missing the header:

#!/usr/bin/env sh
#
@ 2>/dev/null # 2>nul & echo off & goto BOF
:
exec java -Xmx4G -XX:+UseG1GC $JAVA_OPTS -cp "$0" org.globalbioticinteractions.nomer.Nomer "$@"
exit

:BOF
@echo off
java -Xmx4G -XX:+UseG1GC %JAVA_OPTS% -cp "%~dpnx0" org.globalbioticinteractions.nomer.Nomer %*
exit /B %errorlevel%

Why?

jhpoelen commented 8 months ago

@zedomel thanks for your message. Apologies for the nomer.jar . . . I omitted to prepend the .travis.jar.magic file to the nomer.jar using

cat .travis.jar.magic nomer/target/nomer-0.5.7-jar-with-dependencies.jar > nomer.jar

I've updated the artifact - please try again.

Hmm. Perhaps a good reason to automate this distribution process. . . . what do you think?

jhpoelen commented 8 months ago

@zedomel also, please holler if you were able to use the updated nomer.jar v0.5.7 with the upgraded GBIF backbone you published.

Meanwhile, I figuring out how to allow for overrides - like the one you tried by blanking out the preston configuration.

Thanks for being patient.

zedomel commented 8 months ago

@jhpoelen it worked. Thank you.