Closed FelipeSBarros closed 3 years ago
Interesting, I think you're the first person besides myself to ask about the DwC format. Yeah, what you want makes sense and I think this is doable. Let me do some more digging and I'll get back you on that.
Mind if I ask what you're working on? Just curious!
Some related threads:
Also, to give credit where it's due, Nico and I didn't make the iNaturalist API, we just made a python client for it. 😃 The API itself was made by the iNat developers: https://github.com/inaturalist/iNaturalistAPI.
Well, here was my first attempt, but it doesn't work quite like I thought it would. Unfortunately this is painfully slow and requires too many requests. I thought that rest_api.get_observations()
could query multiple IDs at once (like get_observations(id=[111,222,333,...]
, which you can do with node_api.get_observations()
), but I just found out it can only query a single ID at a time.
This example requires the latest pre-release build, which you can get with pip install -U --pre pyinaturalist
:
from pyinaturalist.rest_api import get_observations as get_observations_v0
from pyinaturalist.node_api import get_observations as get_observations_v1
# First pass: get observation IDs for a project
response = get_observations_v1(project_id=1234, only_id=True, page='all')
obs_ids= [result['id'] for result in response['results']]
# Second pass: get observations in DWC format
responses = []
for obs_id in obs_ids:
responses.append(get_observations_v0(id=obs_id, response_format='dwc'))
So the approach you mentioned of mapping iNat response fields to DWC is probably what you'll have to do. Here's the iNaturalist code responsible for doing the mapping (in Ruby):
Hey, @JWCook ! Nice to meet you!
Mind if I ask what you're working on? Just curious! Of curse, not! I am working in a biodiversity institution in Misiones, Argentina, called Instituto Misionero de Biodiversidad. We have a educational project and we are using the "bioblitz" concept and INaturalist app in some schools. The challenge I have is to access the project data and all observations to get some statistics (pretty much the same shown in the project page). But we want to have it in our webpage.
For now, DwC won't be a necessity. But I am pretty sure that will be interesting to have the obsevations in DwC, so we can join with our official data base (which, uses DwC), soon.
Well, here was my first attempt, but it doesn't work quite like I thought it would. Unfortunately this is painfully slow and requires too many requests. I thought that rest_api.get_observations() could query multiple IDs at once (like get_observations(id=[111,222,333,...], which you can do with node_api.get_observations()), but I just found out it can only query a single ID at a time.
Really interesting your approach. I will try a little bit to map from json (node_rest
) to dwc
. If I get something usable I share here. Perhaps someone else can use it.
By the way, thanks for the links. I could find this publication you have done about visualization that will be of great help. Best regards
@JWCook , in the issue " Feature request: Add additional observation formats provided by Rails API, you mention:
My current use case for this is writing it to XMP image metadata using the dwc namespace. Could you comment about it? I have no idea what is XMP or what you meant with what you said. But makes me believe that you are using another service to transform what you get in Dwc. Did I understand right? How you did it? Using python? Could you share your solution? By the way, if you didn't get a solution but have a better idea than using rest_api you suggested before, let me know. Perhaps I could help in the implementation.
Hope I am not bothering :)
Best regards
Felipe
We have a educational project and we are using the "bioblitz" concept and INaturalist app in some schools. The challenge I have is to access the project data and all observations to get some statistics (pretty much the same shown in the project page). But we want to have it in our webpage.
That sounds like a fun project! That's definitely doable. If you can tell me what stats you want to get, I may be able to help.
Could you comment about it? I have no idea what is XMP
I was working on something related to photography. XMP is one of the 3 main formats of image metadata, along with EXIF and IPTC. Here's a quick summary: https://expertphotography.com/metadata-exif-iptc-xmp/
Basically what I wanted to do was take DwC metadata (which is XML-based) and embed it inside images using XMP (which is also XML-based), so you could take all your data on iNaturalist and sync it with your local photo collection. That's the goal of this project: https://github.com/JWCook/naturtag. It's not finished yet but has a CLI with a few basic image tagging features. Example output here: example_45524803.xmp.
I don't think that will help much for your case, though, since it's just using rest_api.get_observations()
to get that info.
By the way, if you didn't get a solution but have a better idea than using rest_api you suggested before, let me know. Perhaps I could help in the implementation.
Yeah, the best solution is going to be converting JSON to DwC. I have a couple ideas to make this easier, and I would be interested in adding at least part of that to pyinaturalist, if you'd like to help. I'll post some more info later today.
If you can tell me what stats you want to get, I may be able to help.
Yeah, I will organize the ideas I have and share with you. But probably I will need: Total observations; Total species; % of observations by "group", % of species by "group" Users' rank by observation and amount of species...
Yeah, the best solution is going to be converting JSON to DwC. I have a couple ideas to make this easier, and I would be interested in adding at least part of that to pyinaturalist, if you'd like to help. I'll post some more info later today.
That's sound great. I am not a experienced Python developer, but I would be glad to help, if you don't mind guinding me thru the development/implementation. cheers
First of all, XML is a pain to deal with. Fortunately there's xmltodict, which makes this much easier by letting you convert between python dicts and XML.
So we will need a few things:
license_code
and xap:UsageTerms
).I started a branch for this with an outline here: pyinaturalist/dwc.py Example files:
node_api.get_observations()
): obs_45524803.jsonrest_api.get_observations()
): obs_45524803.dwcSo if you want, you can fork this repo and start from that branch. All you need to do is figure out how the rest of the fields should be mapped, and fill in the missing values. Then it should produce something close to the obs_45524803.dwc
linked above.
@JWCook it is of great help!
I will try to work on that.
I already have done the fork and was looking for the requirements.txt
. Seems that you are using poetry
, right?
What do you mean with
A dict of DwC fields that will be constant values
Also, I am not a DwC expert. But taking a loook to their website I realized that there is a Simple Darwin Core
. Should I base my implementation on this Simple Darwin Core
, right?
Strangelly I coulnd't find anything like "complete Darwin Core" in the web page mentioned. Making me believe that there is only one implementation: the simple one.
I already have done the fork and was looking for the
requirements.txt
. Seems that you are usingpoetry
, right?
Yes, so you can install with poetry install
. There are more details in the Contributing Guide.
What do you mean with A dict of DwC fields that will be constant values
By 'constant values' I just mean things that will be the same for every observation, for example:
<dwc:basisOfRecord>HumanObservation</dwc:basisOfRecord>
<dwc:institutionCode>iNaturalist</dwc:institutionCode>
You can probably figure out the rest of those based on the example .dwc file and some of the iNaturalist code (occurrence.rb and taxon.rb)
Also, I am not a DwC expert
That makes two of us! I think List of Darwin Core terms shows all the information you can potentially include in a DwC record, and "Simple Darwin Core" is the recommended subset of those that covers the majority of use cases (and that's also what iNaturalist uses).
@niconoe Do you have experience with Darwin Core or any other input on this?
Yes, so you can install with poetry install. There are more details in the Contributing Guide.
Ok. I am not used with poetry. But the installation process was smooth.
You can probably figure out the rest of those based on the example .dwc file and some of the iNaturalist code (occurrence.rb and taxon.rb)
Great. I will take a look on those files.
About the Simple Darwin Core
, I will take a look if there is any tern used in INaturalist that is not covered in the Simple Darwin Core
.
Nice discussions!
Yeah, I do have some experience with Darwin Core (mainly: mapping custom databases/text files to Darwin Core so they can get published to GBIF using a tool such as the IPT). I'm also the developper of Python-dwca-reader (dwca being a standardised way to store data using the Darwin Core terms using a bunch of CSV files, XML metadata, ...).
I don't know exactly how I can help here, so here are already a few random thoughts:
what
, where
, when
), some necessary metadata (basisOfRecord
, license information, ...) is already a great step towards interoperability.Don't hesitate to tell if you have more specific questions!
Cheers,
Nice, @niconoe . Thanks for your thoughts. As I have this challenge for this week I will work on it downloading the data in a Darwin Core structure without using the API. So, @JWCook I will take a wile working in this task, but I will be working on that definitely.
Cheers
@JWCook I have been working but not sure if I am doing it right.
So... Considering your point:
- A dict of iNat observation fields and equivalent DwC occurrence fields
I did a dict, like this:
inat_observation = {
'quality_grade': 'dwc:datasetName',
'time_observed_at': 'dwc:eventDate',
'id': 'dwc:catalogNumber',
'positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
'license_code': 'dcterms:license', # not exactly but seems derived from,
'public_positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
'created_at': 'dwc:eventTime', # not matching but with close values
'description': 'dcterms:description',
'updated_at': 'dcterms:modified',
'uri': 'dcterms:references', # or 'dwc:occurrenceDetails',
# 'geojson', # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'
'location': ['dwc:decimalLongitude', 'dwc:decimalLatitude'], # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'
'place_guess': 'dwc:verbatimLocality',
'observed_on': 'dwc:verbatimEventDate' # but with different standart: YYYY-MM-DD HH:MM:SS-UTC
}
- A dict of iNat taxon fields and equivalent DwC taxon fields
inat_taxon = {
'iconic_taxon_id': 'dwc:phylum', # but not sure
'min_species_taxon_id': 'dwc:taxonID', # but not sure
'name': 'dwc:scientificName',
'rank': 'dwc:taxonRank',
'id': 'dwc:taxonID'
}
I could not find a lot of INat terms on DwC, so I am putting all of them in a spare dict.
I also realized that there are some terms which the values are pretty similar but not exactly equals:
E.g.:
'observed_on': 'dwc:verbatimEventDate' # but with different standart: YYYY-MM-DD HH:MM:SS-UTC
Let me know if there is a better way to approach the tasks you proposed, so I consider it for all missing tasks.
Cheers.
That looks great so far! Thanks for working on this.
For the values that are in a slightly different format (like license codes and datetimes), I can help add converter functions later, if needed. If you find any more, just keep adding comments like you already have.
A couple notes:
dwc:eventTime
probably comes from the time observed, not the time creatediconic_taxon_id
is specific to iNaturalist; those are the general "categories" with the little animal icons that you can use to filter observations: dwc:kingdom
, dwc:phylum
, and other taxon ancestors aren't available in the observation JSON, so we'll need to do a second query for the full taxon info. Then we just need to map all the ranks, like 'kingdom': 'dwc:kingdom'
, etc. There's a function to get taxon ancestors in the dwc.py
outline I started: https://github.com/niconoe/pyinaturalist/blob/e4f525baedd36fe1562717a6cd4f60c7cb961391/pyinaturalist/dwc.py#L62-L73When does your bioblitz start, by the way? Or is that already complete? I haven't participated in one before, but I want to. They sound like fun!
For the values that are in a slightly different format (like license codes and datetimes), I can help add converter functions later, if needed. If you find any more, just keep adding comments like you already have.
OK. I think I will make a review on what I have done, but leaving the terms I didn't find commented in its respective dict. Later we can exclude then, but I think that perhaps this way we can have a track of all terms.
dwc:eventTime
probably comes from the time observed, not the time created
Ok. I will confirm and fix it.
The dwc:kingdom, dwc:phylum, and other taxon ancestors aren't available in the observation JSON, so we'll need to do a second query for the full taxon info. Then we just need to map all the ranks, like 'kingdom': 'dwc:kingdom', etc. There's a function to get taxon ancestors in the dwc.py outline I started:
Thats great!
When does your bioblitz start, by the way? Or is that already complete? I haven't participated in one before, but I want to. They sound like fun!
The bioblitz we organized finished on 22th may (Biodiversity day, here in Argentina... perhaps it is an International day, but not sure.) I didn't participated directly but we got an interesting results. We got some schools involved and the idea is to get bigger next time.
Although we still need to work on the species identifications, I have been developing a system to host the data so we can access with a few graphs. I was thinking in, as soon, as I have time, write tutorial on how to do it with Django, just like you have done with the Jupyter notebook.
@JWCook I think I am almost done.
I started commenting "not found" on INat terms that I couldn't find keeping it in the dict its belongs to. But later I stopped writing "not found" but all of them are commented...
inat_unknown = {
# 'tags': [],
# 'quality_metrics': [],
# 'project_ids_with_curator_id': [], # this might come from matching user_id and projects
# 'sounds': [],
# 'place_ids': [], # list of place ids
# 'ident_taxon_ids': [],
# 'outlinks': [{'source': 'GBIF', 'url': 'http://www.gbif.org/occurrence/2626669957'}],
# 'faves_count': 0,
# 'cached_votes_total',
# 'comments_count',
# 'reviewed_by',
# 'oauth_application_id',
# 'captive',
# 'ofvs',
# 'map_scale',
# 'obscured',
# 'votes',
# 'mappable',
# 'project_ids_without_curator_id',
}
inat_observation = {
'quality_grade': 'dwc:datasetName',
'time_observed_at': 'dwc:eventDate',
'annotations': [], # not found
'id': 'dwc:catalogNumber',
'positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
'license_code': 'dcterms:license', # not exactly but seems derived from,
'public_positional_accuracy': 'dwc:coordinateUncertaintyInMeters',
'created_at': 'xap:CreateDate', # not matching but probably due to UTC
'description': 'dcterms:description',
'updated_at': 'dcterms:modified',
'uri': 'dcterms:references', # or 'dwc:occurrenceDetails',
'geojson': {'type': 'Point', 'coordinates': ['dwc:decimalLongitude', 'dwc:decimalLatitude']}, # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'
'location': ['dwc:decimalLongitude', 'dwc:decimalLatitude'], # can be derived from 'dwc:decimalLatitude' and 'dwc:decimalLongitude'
'place_guess': 'dwc:verbatimLocality',
'observed_on': 'dwc:verbatimEventDate' # but with different standart: YYYY-MM-DD HH:MM:SS-UTC
# 'geoprivacy', not found
}
inat_taxon = {
'iconic_taxon_id': 'dwc:phylum', # but not sure
'min_species_taxon_id': 'dwc:taxonID', # but not sure
'name': 'dwc:scientificName',
'rank': 'dwc:taxonRank',
'id': 'dwc:taxonID',
'species_guess': 'dwc:scientificName', # similar term not found
'community_taxon_id': 'dwc:taxonID',
# 'is_active': bool, # not found
# 'ancestry': str, # not found
# 'min_species_ancestry': str, # not found
# 'identifications_most_agree': True, # similar term not found
# 'identifications_most_disagree': False # similar term not found
# 'num_identification_agreements': int, # similar term not found
# 'identifications_some_agree': bool,
# 'owners_identification_from_vision': bool, # similar term not found
# 'identifications_count': int, # similar term not found
# 'taxon_geoprivacy': None, # not found
# 'num_identification_disagreements': int,
# 'endemic': bool,
# 'threatened': bool,
# 'rank_level': int,
# 'introduced': bool,
# 'native': bool,
# 'parent_id': int,
# 'extinct': bool,
# 'ancestor_ids': list,
# 'current_synonymous_taxon_ids': None,
# 'taxon_changes_count': int,
# 'complete_species_count': None,
# 'photos_locked': bool,
# 'taxon_schemes_count': int,
# 'wikipedia_url': str,
# 'created_at': datetime,
# 'universal_search_rank': int,
# 'observations_count': int,
# 'flag_counts': dict,
# 'atlas_id': None,
# 'default_photo': dict,
# 'preferred_common_name'
}
inat_identifications = {
'created_at': 'xap:CreateDate',
'taxon_id': 'dwc:taxonID',
'id': 'dwc:identificationID',
'previous_observation_taxon_id': 'dwc:taxonID',
# 'hidden': bool,
# 'disagreement',
# 'flags': list,
# 'body': None,
# 'own_observation': bool,
# 'uuid',
# 'taxon_change': None,
# 'moderator_actions': list,
# 'vision': bool,
# 'current': bool,
# 'created_at_details': dict, # derived from xap:CreateDate,
# 'category': str,
# 'spam': bool,
# 'user': dict, # user informations
# 'taxon': dict,
}
inat_photos = {
'id': 'dcterms:identifier', # or ac:accessURI, media:thumbnailURL, ac:furtherInformationURL, ac:derivedFrom, ac:derivedFrom
'license_code': 'xap:UsageTerms',
'url': 'dcterms:identifier', # or ac:accessURI, media:thumbnailURL, ac:furtherInformationURL, ac:derivedFrom, ac:derivedFrom # change the host to amazon
'attribution': 'dcterms:rights',
# 'original_dimensions': {'width': 2048, 'height': 1368}, 'flags': [], # not found
}
inat_user = {
'login': 'dwc:inaturalistLogin',
'login_autocomplete': 'dwc:inaturalistLogin',
'login_exact': 'dwc:inaturalistLogin',
'name': 'xap:Owner', # or dcterms:creator dwc:recordedBy
'name_autocomplete': 'xap:Owner', # or dcterms:creator dwc:recordedBy,
# 'id': 317669, # not found
# 'spam': False, # not found
# 'suspended': False,
# 'created_at': '2016-08-31T02:14:00+00:00',
# 'site_id': 1,
# 'preferences': {'prefers_community_taxa': True},
# 'orcid': None, # not found
# 'icon': 'https://static.inaturalist.org/attachments/users/icons/317669/thumb.jpg?1504400797', # not found
# 'observations_count': 16017, # not found
# 'identifications_count': 9398, # not found
# 'journal_posts_count': 0, # not found
# 'activity_count': 25415, # not found
# 'species_count': 5182, # not found
# 'universal_search_rank': 16017, # not found
# 'roles': [], # not found
# 'icon_url': 'https://static.inaturalist.org/attachments/users/icons/317669/medium.jpg?1504400797' # not found
}
observation = {
'dwr:SimpleDarwinRecordSet': {
'dwr:SimpleDarwinRecord':inat_observation},
'taxon': inat_taxon,
'identifications': inat_identifications,
'default_photo':inat_photos, # or key 'photos' with a list of following information
'user':inat_user,
}
What do you think?
Looks good! Would you like to go ahead and submit a pull request for that? It's okay if it's not 100% finished yet. I'll probably have some time next week to start adding onto it.
I started a separate package for conversion tools for observation data: https://github.com/JWCook/pyinaturalist-convert (resulting from a separate issue, #118). I decided to keep that separate from the main pyinaturalist
package due to the number of dependencies it adds, and that seems like a good place for Darwin Core as well. Later I'll add pyinaturalist-convert
as an optional dependency for the main pyinaturalist
package, so it can be installed with pip install pyinaturalist[converters]
(or something like that).
I moved the module outline I sent you earlier, so you can add your information here: pyinaturalist_convert/dwc.py. That was just a draft, so feel free to make any other changes you want. And let me know if you need any help with git, starting a PR, etc.
Thats great, @JWCook ! So, if I understand right: I should send a PR in pyinaturalist_convert. And the dicts I have created should be in the dwc.py. I will be working on that right now. Thanks
Okay, that's merged in now. I added taxon ancestry onto that, and here's an example of the output so far:
<?xml version="1.0" encoding="utf-8"?>
<dwr:SimpleDarwinRecordSet>
<dwr:SimpleDarwinRecord>
<dwc:catalogNumber>45524803</dwc:catalogNumber>
<dwc:eventDate>2020-05-09T06:01:00-07:00</dwc:eventDate>
<dwc:datasetName>research</dwc:datasetName>
<dwc:coordinateUncertaintyInMeters>4</dwc:coordinateUncertaintyInMeters>
<dcterms:license>cc-by-nc</dcterms:license>
<xap:CreateDate>2020-05-10 13:42:12-07:00</xap:CreateDate>
<dcterms:description>x13 seen this morning </dcterms:description>
<dcterms:modified>2020-08-16 18:09:44-07:00</dcterms:modified>
<dcterms:references>https://www.inaturalist.org/observations/45524803</dcterms:references>
<dwc:verbatimLocality>San Diego County, CA, USA</dwc:verbatimLocality>
<dwc:taxonID>48978</dwc:taxonID>
<dwc:taxonRank>species</dwc:taxonRank>
<dwc:scientificName>Dirona picta</dwc:scientificName>
<dwc:kingdom>Animalia</dwc:kingdom>
<dwc:phylum>Mollusca</dwc:phylum>
<dwc:class>Gastropoda</dwc:class>
<dwc:order>Nudibranchia</dwc:order>
<dwc:family>Dironidae</dwc:family>
<dwc:genus>Dirona</dwc:genus>
<eol:dataObject>
<dcterms:identifier>72181173</dcterms:identifier>
<xap:UsageTerms>cc-by-nc</xap:UsageTerms>
<dcterms:rights>(c) Alex Bairstow, some rights reserved (CC BY-NC)</dcterms:rights>
</eol:dataObject>
<dwc:basisOfRecord>HumanObservation</dwc:basisOfRecord>
<dwc:institutionCode>iNaturalist</dwc:institutionCode>
</dwr:SimpleDarwinRecord>
</dwr:SimpleDarwinRecordSet>
So the next steps are adding XML namespaces and formatting some of the fields that differ slightly. I will work on that soon.
Great, @JWCook ! Let me know to support your in the next steps. Should we close ths issue and start working on the pyinaturalist_convert repository? Cheers
Sounds good, here's an issue for the remaining things to do: https://github.com/JWCook/pyinaturalist-convert/issues/7
First of all: congratulations and thanks for this amazing API! Not sure if what I want makes sense but and I am contacting here but I don't think is the case o a feature request... Any way:
Is your feature request related to a problem? Please describe. I would like to download observations related to a project (
get_observations(project_id=XXXX)
fromnode_api
) but I am a bit lost on mapping the results on DWC fields. I have saw that usingrest_api
is is possible to userequest_format='dwc'
, but it is not possible to useproject_id
parameter.Describe the solution you'd like Doesn't need to be a implementation, or a enhancement in the API. I am just wondering if there is any material with field matching between the
get_observetions
'sresults
fromnode_api
and DWC. Or any suggestion in this way.Thanks in advance felipe