pyinat / pyinaturalist

Python client for iNaturalist
https://pyinaturalist.readthedocs.io
MIT License
133 stars 16 forks source link

Convert observation JSON to Darwin Core format #143

Closed FelipeSBarros closed 3 years ago

FelipeSBarros commented 3 years ago

First of all: congratulations and thanks for this amazing API! Not sure if what I want makes sense but and I am contacting here but I don't think is the case o a feature request... Any way:

Is your feature request related to a problem? Please describe. I would like to download observations related to a project (get_observations(project_id=XXXX) from node_api) but I am a bit lost on mapping the results on DWC fields. I have saw that using rest_api is is possible to use request_format='dwc', but it is not possible to use project_id parameter.

Describe the solution you'd like Doesn't need to be a implementation, or a enhancement in the API. I am just wondering if there is any material with field matching between the get_observetions's results from node_api and DWC. Or any suggestion in this way.

Thanks in advance felipe

JWCook commented 3 years ago

Interesting, I think you're the first person besides myself to ask about the DwC format. Yeah, what you want makes sense and I think this is doable. Let me do some more digging and I'll get back you on that.

Mind if I ask what you're working on? Just curious!

Some related threads:

Also, to give credit where it's due, Nico and I didn't make the iNaturalist API, we just made a python client for it. 😃 The API itself was made by the iNat developers: https://github.com/inaturalist/iNaturalistAPI.

JWCook commented 3 years ago

Well, here was my first attempt, but it doesn't work quite like I thought it would. Unfortunately this is painfully slow and requires too many requests. I thought that rest_api.get_observations() could query multiple IDs at once (like get_observations(id=[111,222,333,...], which you can do with node_api.get_observations()), but I just found out it can only query a single ID at a time.

This example requires the latest pre-release build, which you can get with pip install -U --pre pyinaturalist:

from pyinaturalist.rest_api import get_observations as get_observations_v0
from pyinaturalist.node_api import get_observations as get_observations_v1

# First pass: get observation IDs for a project
response = get_observations_v1(project_id=1234, only_id=True, page='all')
obs_ids= [result['id'] for result in response['results']]

# Second pass: get observations in DWC format
responses = []
for obs_id in obs_ids:
     responses.append(get_observations_v0(id=obs_id, response_format='dwc'))

So the approach you mentioned of mapping iNat response fields to DWC is probably what you'll have to do. Here's the iNaturalist code responsible for doing the mapping (in Ruby):

FelipeSBarros commented 3 years ago

Hey, @JWCook ! Nice to meet you!

Mind if I ask what you're working on? Just curious! Of curse, not! I am working in a biodiversity institution in Misiones, Argentina, called Instituto Misionero de Biodiversidad. We have a educational project and we are using the "bioblitz" concept and INaturalist app in some schools. The challenge I have is to access the project data and all observations to get some statistics (pretty much the same shown in the project page). But we want to have it in our webpage.

For now, DwC won't be a necessity. But I am pretty sure that will be interesting to have the obsevations in DwC, so we can join with our official data base (which, uses DwC), soon.

Well, here was my first attempt, but it doesn't work quite like I thought it would. Unfortunately this is painfully slow and requires too many requests. I thought that rest_api.get_observations() could query multiple IDs at once (like get_observations(id=[111,222,333,...], which you can do with node_api.get_observations()), but I just found out it can only query a single ID at a time.

Really interesting your approach. I will try a little bit to map from json (node_rest) to dwc. If I get something usable I share here. Perhaps someone else can use it.

By the way, thanks for the links. I could find this publication you have done about visualization that will be of great help. Best regards

FelipeSBarros commented 3 years ago

@JWCook , in the issue " Feature request: Add additional observation formats provided by Rails API, you mention:

My current use case for this is writing it to XMP image metadata using the dwc namespace. Could you comment about it? I have no idea what is XMP or what you meant with what you said. But makes me believe that you are using another service to transform what you get in Dwc. Did I understand right? How you did it? Using python? Could you share your solution? By the way, if you didn't get a solution but have a better idea than using rest_api you suggested before, let me know. Perhaps I could help in the implementation.

Hope I am not bothering :)

Best regards

Felipe

JWCook commented 3 years ago

We have a educational project and we are using the "bioblitz" concept and INaturalist app in some schools. The challenge I have is to access the project data and all observations to get some statistics (pretty much the same shown in the project page). But we want to have it in our webpage.

That sounds like a fun project! That's definitely doable. If you can tell me what stats you want to get, I may be able to help.

Could you comment about it? I have no idea what is XMP

I was working on something related to photography. XMP is one of the 3 main formats of image metadata, along with EXIF and IPTC. Here's a quick summary: https://expertphotography.com/metadata-exif-iptc-xmp/

Basically what I wanted to do was take DwC metadata (which is XML-based) and embed it inside images using XMP (which is also XML-based), so you could take all your data on iNaturalist and sync it with your local photo collection. That's the goal of this project: https://github.com/JWCook/naturtag. It's not finished yet but has a CLI with a few basic image tagging features. Example output here: example_45524803.xmp.

I don't think that will help much for your case, though, since it's just using rest_api.get_observations() to get that info.

By the way, if you didn't get a solution but have a better idea than using rest_api you suggested before, let me know. Perhaps I could help in the implementation.

Yeah, the best solution is going to be converting JSON to DwC. I have a couple ideas to make this easier, and I would be interested in adding at least part of that to pyinaturalist, if you'd like to help. I'll post some more info later today.

FelipeSBarros commented 3 years ago

If you can tell me what stats you want to get, I may be able to help.

Yeah, I will organize the ideas I have and share with you. But probably I will need: Total observations; Total species; % of observations by "group", % of species by "group" Users' rank by observation and amount of species...

Yeah, the best solution is going to be converting JSON to DwC. I have a couple ideas to make this easier, and I would be interested in adding at least part of that to pyinaturalist, if you'd like to help. I'll post some more info later today.

That's sound great. I am not a experienced Python developer, but I would be glad to help, if you don't mind guinding me thru the development/implementation. cheers

JWCook commented 3 years ago

First of all, XML is a pain to deal with. Fortunately there's xmltodict, which makes this much easier by letting you convert between python dicts and XML.

So we will need a few things:

I started a branch for this with an outline here: pyinaturalist/dwc.py Example files:

So if you want, you can fork this repo and start from that branch. All you need to do is figure out how the rest of the fields should be mapped, and fill in the missing values. Then it should produce something close to the obs_45524803.dwc linked above.

FelipeSBarros commented 3 years ago

@JWCook it is of great help! I will try to work on that. I already have done the fork and was looking for the requirements.txt. Seems that you are using poetry, right? What do you mean with

A dict of DwC fields that will be constant values

Also, I am not a DwC expert. But taking a loook to their website I realized that there is a Simple Darwin Core. Should I base my implementation on this Simple Darwin Core, right? Strangelly I coulnd't find anything like "complete Darwin Core" in the web page mentioned. Making me believe that there is only one implementation: the simple one.

JWCook commented 3 years ago

I already have done the fork and was looking for the requirements.txt. Seems that you are using poetry, right?

Yes, so you can install with poetry install. There are more details in the Contributing Guide.

What do you mean with A dict of DwC fields that will be constant values

By 'constant values' I just mean things that will be the same for every observation, for example:

<dwc:basisOfRecord>HumanObservation</dwc:basisOfRecord>
<dwc:institutionCode>iNaturalist</dwc:institutionCode>

You can probably figure out the rest of those based on the example .dwc file and some of the iNaturalist code (occurrence.rb and taxon.rb)

Also, I am not a DwC expert

That makes two of us! I think List of Darwin Core terms shows all the information you can potentially include in a DwC record, and "Simple Darwin Core" is the recommended subset of those that covers the majority of use cases (and that's also what iNaturalist uses).

@niconoe Do you have experience with Darwin Core or any other input on this?

FelipeSBarros commented 3 years ago

Yes, so you can install with poetry install. There are more details in the Contributing Guide.

Ok. I am not used with poetry. But the installation process was smooth.

You can probably figure out the rest of those based on the example .dwc file and some of the iNaturalist code (occurrence.rb and taxon.rb)

Great. I will take a look on those files.

About the Simple Darwin Core, I will take a look if there is any tern used in INaturalist that is not covered in the Simple Darwin Core.

niconoe commented 3 years ago

Nice discussions!

Yeah, I do have some experience with Darwin Core (mainly: mapping custom databases/text files to Darwin Core so they can get published to GBIF using a tool such as the IPT). I'm also the developper of Python-dwca-reader (dwca being a standardised way to store data using the Darwin Core terms using a bunch of CSV files, XML metadata, ...).

I don't know exactly how I can help here, so here are already a few random thoughts:

Don't hesitate to tell if you have more specific questions!

Cheers,

FelipeSBarros commented 3 years ago

Nice, @niconoe . Thanks for your thoughts. As I have this challenge for this week I will work on it downloading the data in a Darwin Core structure without using the API. So, @JWCook I will take a wile working in this task, but I will be working on that definitely.

Cheers

FelipeSBarros commented 3 years ago

@JWCook I have been working but not sure if I am doing it right.

So... Considering your point:

  • A dict of iNat observation fields and equivalent DwC occurrence fields

I did a dict, like this:

inat_observation = {
        'quality_grade': 'dwc:datasetName',
        'time_observed_at': 'dwc:eventDate',
        'id': 'dwc:catalogNumber',
        'positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
        'license_code': 'dcterms:license', # not exactly but seems derived from,
        'public_positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
        'created_at': 'dwc:eventTime', # not matching but with close values
        'description': 'dcterms:description',
        'updated_at': 'dcterms:modified',
        'uri': 'dcterms:references', # or 'dwc:occurrenceDetails',
        # 'geojson', # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'  
        'location': ['dwc:decimalLongitude', 'dwc:decimalLatitude'], # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'
        'place_guess': 'dwc:verbatimLocality',
        'observed_on': 'dwc:verbatimEventDate' # but with different standart: YYYY-MM-DD HH:MM:SS-UTC
    }
  • A dict of iNat taxon fields and equivalent DwC taxon fields
inat_taxon = {
    'iconic_taxon_id': 'dwc:phylum', # but not sure
    'min_species_taxon_id': 'dwc:taxonID', # but not sure
    'name': 'dwc:scientificName',
    'rank': 'dwc:taxonRank',
    'id': 'dwc:taxonID'
}

I could not find a lot of INat terms on DwC, so I am putting all of them in a spare dict. I also realized that there are some terms which the values are pretty similar but not exactly equals: E.g.: 'observed_on': 'dwc:verbatimEventDate' # but with different standart: YYYY-MM-DD HH:MM:SS-UTC

Let me know if there is a better way to approach the tasks you proposed, so I consider it for all missing tasks.

Cheers.

JWCook commented 3 years ago

That looks great so far! Thanks for working on this.

For the values that are in a slightly different format (like license codes and datetimes), I can help add converter functions later, if needed. If you find any more, just keep adding comments like you already have.

A couple notes:

JWCook commented 3 years ago

When does your bioblitz start, by the way? Or is that already complete? I haven't participated in one before, but I want to. They sound like fun!

FelipeSBarros commented 3 years ago

For the values that are in a slightly different format (like license codes and datetimes), I can help add converter functions later, if needed. If you find any more, just keep adding comments like you already have.

OK. I think I will make a review on what I have done, but leaving the terms I didn't find commented in its respective dict. Later we can exclude then, but I think that perhaps this way we can have a track of all terms.

dwc:eventTime probably comes from the time observed, not the time created

Ok. I will confirm and fix it.

The dwc:kingdom, dwc:phylum, and other taxon ancestors aren't available in the observation JSON, so we'll need to do a second query for the full taxon info. Then we just need to map all the ranks, like 'kingdom': 'dwc:kingdom', etc. There's a function to get taxon ancestors in the dwc.py outline I started:

Thats great!

When does your bioblitz start, by the way? Or is that already complete? I haven't participated in one before, but I want to. They sound like fun!

The bioblitz we organized finished on 22th may (Biodiversity day, here in Argentina... perhaps it is an International day, but not sure.) I didn't participated directly but we got an interesting results. We got some schools involved and the idea is to get bigger next time.

Although we still need to work on the species identifications, I have been developing a system to host the data so we can access with a few graphs. I was thinking in, as soon, as I have time, write tutorial on how to do it with Django, just like you have done with the Jupyter notebook. image

FelipeSBarros commented 3 years ago

@JWCook I think I am almost done.

I started commenting "not found" on INat terms that I couldn't find keeping it in the dict its belongs to. But later I stopped writing "not found" but all of them are commented...

inat_unknown = {
    # 'tags': [],
    # 'quality_metrics': [],
    # 'project_ids_with_curator_id': [], # this might come from matching user_id and projects
    # 'sounds': [],
    # 'place_ids': [], # list of place ids
    # 'ident_taxon_ids': [],
    # 'outlinks': [{'source': 'GBIF', 'url': 'http://www.gbif.org/occurrence/2626669957'}],
    # 'faves_count': 0,
    # 'cached_votes_total',
    # 'comments_count',
    # 'reviewed_by',
    # 'oauth_application_id',
    # 'captive',
    # 'ofvs',
    # 'map_scale',
    # 'obscured',
    # 'votes',
    # 'mappable',
    # 'project_ids_without_curator_id',
}

inat_observation = {
    'quality_grade': 'dwc:datasetName',
    'time_observed_at': 'dwc:eventDate',
    'annotations': [], # not found
    'id': 'dwc:catalogNumber',
    'positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
    'license_code': 'dcterms:license', # not exactly but seems derived from,
    'public_positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
    'created_at': 'xap:CreateDate', # not matching but probably due to UTC
    'description': 'dcterms:description',
    'updated_at': 'dcterms:modified',
    'uri': 'dcterms:references', # or 'dwc:occurrenceDetails',
    'geojson': {'type': 'Point', 'coordinates': ['dwc:decimalLongitude', 'dwc:decimalLatitude']}, # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'
    'location': ['dwc:decimalLongitude', 'dwc:decimalLatitude'], # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'
    'place_guess': 'dwc:verbatimLocality',
    'observed_on': 'dwc:verbatimEventDate' # but with different standart: YYYY-MM-DD HH:MM:SS-UTC
    # 'geoprivacy', not found
}

inat_taxon = {
    'iconic_taxon_id': 'dwc:phylum', # but not sure
    'min_species_taxon_id': 'dwc:taxonID', # but not sure
    'name': 'dwc:scientificName',
    'rank': 'dwc:taxonRank',
    'id': 'dwc:taxonID',
    'species_guess': 'dwc:scientificName', # similar term not found
    'community_taxon_id': 'dwc:taxonID',
    # 'is_active': bool, # not found
    # 'ancestry': str, # not found
    # 'min_species_ancestry': str, # not found
    # 'identifications_most_agree': True, # similar term not found
    # 'identifications_most_disagree': False # similar term not found
    # 'num_identification_agreements': int, # similar term not found
    # 'identifications_some_agree': bool,
    # 'owners_identification_from_vision': bool, # similar term not found
    # 'identifications_count': int, # similar term not found
    # 'taxon_geoprivacy': None, # not found
    # 'num_identification_disagreements': int,
    # 'endemic': bool,
    # 'threatened': bool,
    # 'rank_level': int,
    # 'introduced': bool,
    # 'native': bool,
    # 'parent_id': int,
    # 'extinct': bool,
    # 'ancestor_ids': list,
    # 'current_synonymous_taxon_ids': None,
    # 'taxon_changes_count': int,
    # 'complete_species_count': None,
    # 'photos_locked': bool,
    # 'taxon_schemes_count': int,
    # 'wikipedia_url': str,
    # 'created_at': datetime,
    # 'universal_search_rank': int,
    # 'observations_count': int,
    # 'flag_counts': dict,
    # 'atlas_id': None,
    # 'default_photo': dict,
    # 'preferred_common_name'
}

inat_identifications = {
    'created_at': 'xap:CreateDate',
    'taxon_id': 'dwc:taxonID',
    'id': 'dwc:identificationID',
    'previous_observation_taxon_id': 'dwc:taxonID',
    # 'hidden': bool,
    # 'disagreement',
    # 'flags': list,
    # 'body': None,
    # 'own_observation': bool,
    # 'uuid',
    # 'taxon_change': None,
    # 'moderator_actions': list,
    # 'vision': bool,
    # 'current': bool,
    # 'created_at_details': dict, # derived from xap:CreateDate,
    # 'category': str,
    # 'spam': bool,
    # 'user': dict, # user informations
    # 'taxon': dict,
}

inat_photos = {
    'id': 'dcterms:identifier', # or ac:accessURI, media:thumbnailURL, ac:furtherInformationURL, ac:derivedFrom, ac:derivedFrom
    'license_code': 'xap:UsageTerms',
    'url': 'dcterms:identifier', # or ac:accessURI, media:thumbnailURL, ac:furtherInformationURL, ac:derivedFrom, ac:derivedFrom # change the host to amazon
    'attribution': 'dcterms:rights',
    # 'original_dimensions': {'width': 2048, 'height': 1368}, 'flags': [], # not found
}

inat_user = {
    'login': 'dwc:inaturalistLogin',
    'login_autocomplete': 'dwc:inaturalistLogin',
    'login_exact': 'dwc:inaturalistLogin',
    'name': 'xap:Owner', # or dcterms:creator dwc:recordedBy
    'name_autocomplete': 'xap:Owner', # or dcterms:creator dwc:recordedBy,
    # 'id': 317669, # not found
    # 'spam': False, # not found
    # 'suspended': False,
    # 'created_at': '2016-08-31T02:14:00+00:00',
    # 'site_id': 1,
    # 'preferences': {'prefers_community_taxa': True},
    # 'orcid': None, # not found
    # 'icon': 'https://static.inaturalist.org/attachments/users/icons/317669/thumb.jpg?1504400797', # not found
    # 'observations_count': 16017, # not found
    # 'identifications_count': 9398, # not found
    # 'journal_posts_count': 0, # not found
    # 'activity_count': 25415, # not found
    # 'species_count': 5182, # not found
    # 'universal_search_rank': 16017, # not found
    # 'roles': [], # not found
    # 'icon_url': 'https://static.inaturalist.org/attachments/users/icons/317669/medium.jpg?1504400797' # not found
}

observation = {
    'dwr:SimpleDarwinRecordSet': {
        'dwr:SimpleDarwinRecord':inat_observation},
    'taxon': inat_taxon,
    'identifications': inat_identifications,
    'default_photo':inat_photos, # or key 'photos' with a list of following information
    'user':inat_user,
}

What do you think?

JWCook commented 3 years ago

Looks good! Would you like to go ahead and submit a pull request for that? It's okay if it's not 100% finished yet. I'll probably have some time next week to start adding onto it.

I started a separate package for conversion tools for observation data: https://github.com/JWCook/pyinaturalist-convert (resulting from a separate issue, #118). I decided to keep that separate from the main pyinaturalist package due to the number of dependencies it adds, and that seems like a good place for Darwin Core as well. Later I'll add pyinaturalist-convert as an optional dependency for the main pyinaturalist package, so it can be installed with pip install pyinaturalist[converters] (or something like that).

I moved the module outline I sent you earlier, so you can add your information here: pyinaturalist_convert/dwc.py. That was just a draft, so feel free to make any other changes you want. And let me know if you need any help with git, starting a PR, etc.

FelipeSBarros commented 3 years ago

Thats great, @JWCook ! So, if I understand right: I should send a PR in pyinaturalist_convert. And the dicts I have created should be in the dwc.py. I will be working on that right now. Thanks

JWCook commented 3 years ago

Okay, that's merged in now. I added taxon ancestry onto that, and here's an example of the output so far:

<?xml version="1.0" encoding="utf-8"?>
<dwr:SimpleDarwinRecordSet>
    <dwr:SimpleDarwinRecord>
        <dwc:catalogNumber>45524803</dwc:catalogNumber>
        <dwc:eventDate>2020-05-09T06:01:00-07:00</dwc:eventDate>
        <dwc:datasetName>research</dwc:datasetName>
        <dwc:coordinateUncertaintyInMeters>4</dwc:coordinateUncertaintyInMeters>
        <dcterms:license>cc-by-nc</dcterms:license>
        <xap:CreateDate>2020-05-10 13:42:12-07:00</xap:CreateDate>
        <dcterms:description>x13 seen this morning </dcterms:description>
        <dcterms:modified>2020-08-16 18:09:44-07:00</dcterms:modified>
        <dcterms:references>https://www.inaturalist.org/observations/45524803</dcterms:references>
        <dwc:verbatimLocality>San Diego County, CA, USA</dwc:verbatimLocality>
        <dwc:taxonID>48978</dwc:taxonID>
        <dwc:taxonRank>species</dwc:taxonRank>
        <dwc:scientificName>Dirona picta</dwc:scientificName>
        <dwc:kingdom>Animalia</dwc:kingdom>
        <dwc:phylum>Mollusca</dwc:phylum>
        <dwc:class>Gastropoda</dwc:class>
        <dwc:order>Nudibranchia</dwc:order>
        <dwc:family>Dironidae</dwc:family>
        <dwc:genus>Dirona</dwc:genus>
        <eol:dataObject>
            <dcterms:identifier>72181173</dcterms:identifier>
            <xap:UsageTerms>cc-by-nc</xap:UsageTerms>
            <dcterms:rights>(c) Alex Bairstow, some rights reserved (CC BY-NC)</dcterms:rights>
        </eol:dataObject>
        <dwc:basisOfRecord>HumanObservation</dwc:basisOfRecord>
        <dwc:institutionCode>iNaturalist</dwc:institutionCode>
    </dwr:SimpleDarwinRecord>
</dwr:SimpleDarwinRecordSet>

So the next steps are adding XML namespaces and formatting some of the fields that differ slightly. I will work on that soon.

FelipeSBarros commented 3 years ago

Great, @JWCook ! Let me know to support your in the next steps. Should we close ths issue and start working on the pyinaturalist_convert repository? Cheers

JWCook commented 3 years ago

Sounds good, here's an issue for the remaining things to do: https://github.com/JWCook/pyinaturalist-convert/issues/7