Observations to/from Pandas DataFrame

arky commented 7 months ago

Please kindly provide helper functions to quickly turn get_observations into a pandas dataframe. Similar function could enable users to quickly feed the values of dataframe into create_observations

Rationale: Pandas is the swiss army knife of data processing, it would enable experience users to do analysis of data extracted from iNaturalist. This feature enables users who have found the currently helper functions for creating histograms, etc limiting for their use cases.

Use case

Problem: Extract your observations and sort them based on observed_at and created_at to find any duplicates.


# Replace with your own username
USERNAME = 'jkcook'

response = get_observations(user_id=USERNAME, page='all')
my_observations = Observation_df.from_json_list(response)
my_observations #This is now a Pandas Dataframe

Another usecase would be to quickly create large amount of observations from a CSV file/XLS file.


 df = pandas.read_csv('myfile.csv')
create_observation(
     access_token=token
     from_df = df
)

I think for doing bulk uploads pandas dataframes are ideal as you can track which observations have been uploads and restarting dataframe with index number of the row..

JWCook commented 7 months ago

I have some data conversion utilities over here, which will do part of what you want: https://github.com/pyinat/pyinaturalist-convert (full docs here). It's in a separate library because it has quite a few extra dependencies.

Example of loading observations into a dataframe:

from pyinaturalist import get_observations
from pyinaturalist_convert import to_dataframe

response = get_observations(user_id='jkcook', page='all')
df = to_dataframe(response)

P.S., there's a "secret" namespace alias pyinat that includes modules from both libraries (if installed):

from pyinat import get_observations, to_dataframe

That flattens out several pieces of nested data (photos, identifications, taxon, etc.) to make it a bit easier to work with and save in a tabular format. From there, you can export and re-read it in whatever data format you prefer. Personally I've found parquet to be the most useful for observation data (in terms of performance and disk usage for larger datasets). Example:

import pandas as pd

df.to_parquet('observations.parquet')
df = pd.read_parquet('observations.parquet')

I don't yet have any features for turning that back into a format for creating/updating observations, but that's something I could add. I may not have time to work on that this week, but I do have some ideas, so I'll get back to you on that.

arky commented 7 months ago

@JWCook Whoa! thanks I am going to explore more. Thanks!

Perhaps just documenting the availability of these pandas features would be enough for now.

JWCook commented 7 months ago

Let me know if that dataframe format isn't exactly what you need. So far that library has mainly been tailored for my own usage, but I'm definitely willing to make changes there to accommodate other use cases.

As for docs, I was thinking of putting together a tutorial notebook that uses both of these libraries... but it seems like every time I start on that, I find something else I want to fix or polish first before showing it off!

arky commented 7 months ago

@JWCook Perhaps https://github.com/pyinat/pyinaturalist-convert project could be called as pyinaturalist-utils and add all extraneous special case features such as this one to it.

What do you reckon?

arky commented 6 months ago

@JWCook Recommend closing this issue as well.

pyinat / pyinaturalist

Observations to/from Pandas DataFrame #542

Use case