globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

add support for http://hesperomys.com/ #144

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

suggest to add support for http://hesperomys.com/ to Nomer

related to discussions in https://github.com/mammaldiversity/mammaldiversity.github.io/issues/22 https://github.com/mammaldiversity/mammaldiversity.github.io/issues/23 involving @JelleZijlstra @n8upham

JelleZijlstra commented 1 year ago

That would be great! Let me know if you need any help.

Hesperomys makes a distinction between taxa and names. Taxon URLs are of the form http://hesperomys.com/t/32496, with a redirect from the currently valid name (http://hesperomys.com/t/Agathaeromys). Name URLs use /n/, like http://hesperomys.com/n/59009, with a redirect for the original name (e.g. https://hesperomys.com/n/Agathaeromys).

jhpoelen commented 1 year ago

@JelleZijlstra glad to hear that you are excited and eager to help.

I do have a question:

Would you happen to publish your database as a whole dataset?

I am trying to figure out how to access and index your databases for fast (offline) access via Nomer.

JelleZijlstra commented 1 year ago

I currently don't. The database is not in a practical format (a 194 MB sqlite database) and some of the internal data I'd rather not publish. However, I can generate files in some more usable format (e.g., a CSV with all names or all taxa). One of my learnings from talking to @n8upham was that versioning is important, so I am now putting together a plan to introduce a versioning scheme where every time I update the public website I increment the version and save a copy of the database. I could then also generate a data file in CSV format and publish it.

What kind of format would be useful for you?

jhpoelen commented 1 year ago

Thanks for your prompt reply.

What kind of format would be useful for you?

Any digital format that is easy for you to generate.

jhpoelen commented 1 year ago

a sqllite data dump would do just fine :smile:

jhpoelen commented 1 year ago

and you might make biologists happy be producing tabular text files like csv, ideally in denormalized form, so that no joins are needed.

JelleZijlstra commented 1 year ago

and you might make biologists happy be producing tabular text files like csv, ideally in denormalized form, so that no joins are needed.

I wrote a quick export format for the MDD people already, if you email me (email is on my profile) I can send it to you too as a sample. It's not the whole database, but would give a sense of what the data would look like. I can adjust the export script to add additional information and then publish those regularly.

jhpoelen commented 1 year ago

I wrote a quick export format for the MDD people already, Great!

Your proposed sample would help me get started with integration, especially if you share the example publicly or be ok with the same being public.

jhpoelen commented 1 year ago

I've temporarily added your snapshot of mammalia.csv to https://github.com/jhpoelen/hesperomys/ to help prototype an integration with your dataset. Happy to make changes if needed, I attempted to credit your work, but I am sure more can be done.

Also, I have another question.

Is there a way to infer the linked taxon and the type of name-taxon relation from mammals.csv ?

I was able to find the id for the name (e.g., http://hesperomys.com/n/2756 Platypus Anatinus Shaw, 1799 ), but wasn't able to locate the information that helps to generate texts like "Valid name for Ornithorhynchus anatinus" with a link to taxon http://hesperomys.com/t/1958 . However, I do see the taxonomic information related to the taxon (e.g., "https://github.com/globalbioticinteractions/nomer/blob/893ed3f8b604b30a2b346260b54db340c84acf23/nomer-taxon-resolver/src/test/resources/org/globalbioticinteractions/nomer/match/hesperomys/mammals-short.csv#L2), but somehow wasn't able to find the taxon id.

@JelleZijlstra Do you have any suggestions on how to resolve related taxon ids for a name id?

image

JelleZijlstra commented 1 year ago

Great, thanks!

As for the name/taxon link, that information isn't included in the current export, sorry. I'll add a column for the taxon link to the Name exporter, and also a column for the status ("valid", "synonym" and a few other options).

jhpoelen commented 1 year ago

Excellent! Happy to continue the integration work once you share the updated export. Hope it isn't too much work to create a new mammalia.csv . In fact, please feel free to create a pull request for https://github.com/jhpoelen/hesperomys if you feel comfortable doing that.

JelleZijlstra commented 1 year ago

Sounds good! Let me know if you have any other feedback about ways to make the format more useful to you.

Should I include fossils as well as extant species?

jhpoelen commented 1 year ago

Should I include fossils as well as extant species?

Yes please!

JelleZijlstra commented 1 year ago

And should I include higher-rank names as well as species-group names? (The current format was for comparing to species in the MDD, which is why I only included extant species within Mammalia.)

jhpoelen commented 1 year ago

@JelleZijlstra Thanks for the suggestions.

Ideally, all information would be included in the export, with each row being an denormalized, independent representation of a name relation.

And, I also try to be pragmatic, so I'd rather have an updated export with some minor items missing sooner rather than the "perfect" export many months from now.

Curious to see what you come up with.

jhpoelen commented 1 year ago

btw - I've had some success exporting hierarchical data using line json, one json object per line. But . . . json may understandably alienate some folks that are more comfortable in table/spreadsheet land.

JelleZijlstra commented 1 year ago

Ideally, all information would be included in the export, with each row being an denormalized, independent representation of a name relation.

Thanks, that's a good guideline to work with.

And, I also try to be pragmatic, so I'd rather have an updated export with some minor items missing sooner rather than the "perfect" export many months from now.

For the most part these are very easy changes: https://github.com/JelleZijlstra/taxonomy/commit/5e8b7ca4d415b859c46c02b6664107c34c50bbdc. But definitely agree that working now is better than perfect a long time in the future.