attach_all_resource_ids doesn't fully fix resource-ID linking?

flixha commented 3 years ago

Description Thanks a lot for developing obsplus, I'm kind of just getting started with the very useful functionality that it offers. @d-chambers helped me with an issue regarding obspy's ´ResourceIdentifier´ previously (https://github.com/obspy/obspy/issues/2842), for which obsplus contained the function ´attach_all_resource_ids´ to solve the issue. That worked, but I'm running into some limitation now that I'm trying to understand. After reading in events in parallel, I end up with broken resource-ID links that I fix with the function. While I can usually create a working link from the resource-ID string, it doesn't work in the case shown below. Is this just a limitation of ResourceIdentifier that I have to accept, or is there something that attach_all_resource_ids could still help with here?

To Reproduce

import glob, os
from obspy.io.nordic.core import read_nordic
from obspy.io.nordic import tests
from obspy.core.event import Catalog
from obspy.core.event.resourceid import ResourceIdentifier
from joblib import Parallel, delayed
from obsplus import events_to_df
from obsplus.events.validate import attach_all_resource_ids

npath = os.path.dirname(tests.__file__)
sfiles = glob.glob(os.path.join(npath, 'data', '*L.S*'))
sfiles = sfiles + sfiles  # make it at least 2 events
cats = Parallel(n_jobs=2)(delayed(read_nordic)(sfile) for sfile in sfiles)
cat = Catalog([cat[0] for cat in cats])
for event in cat:
    attach_all_resource_ids(event)

cat_df = events_to_df(cat)
event = ResourceIdentifier(cat_df.event_id.iloc[0]).get_referred_object()

Why would I want to use the above code After I filter the cat_df with pandas functionality, i thought the most robust way to retrieve the event that is represented by a line in the cat_df would be to use the cat_df.event_id.

Expected behavior If I set n_jobs=1 in above example, I will be able to retrieve the event. With n_jobs=2, event becomes None. As i thought that I had fixed the resource-ID linking, I would expect to retrieve the event also in the case with the originally broken resource-ID links. I am able to retrieve the event in both cases with the following line:

[event for event in cat if event.resource_id == cat_df.event_id.iloc[0]][0]

Versions (please complete the following information):

OS: RHEL 7.8
ObsPlus Version - current master 0.2.0+11.g8483b67
Python Version 3.8

d-chambers commented 3 years ago

Hi @flixha,

I am glad you are finding ObsPlus useful.

I ran your example code and was able to reproduce the issue. There is actually something wrong going on with obspy.ResourceIdentifier's machinery. If you try:

cat[0].resource_id.get_referred_object() it returns the Event as expected. I am not sure exactly what is causing the issue.

Is there a reason you need to use the string in the dataframe rather than the ResourceIdentifier object in the catalog?

flixha commented 3 years ago

Hi, thanks for checking the example!

I started using Obsplus partly because it makes working with larger catalogs a lot quicker and more user-friendly, e.g., when selecting a subset from a catalog, sorting the catalog by some properties of the events, or when trying to find one specific event ( compared to Obspy catalog functions). So when I for example find a few events in the obsplus-dataframe which I want to remove from the obspy-catalog, then I thought the simplest way to refer to the events is by their ResourceIdentifier; which in this case I construct from the string because that's available in the default dataframe. I guess there's two workarounds:

tmp_cat_df = cat_df.loc[cat_df.time > UTCDateTime(2013,9,1,0,0,0)]
event = cat[tmp_cat_df.index[0]]
cat2 = Catalog([ev for ev in cat if not ev == event])

but addressing by the row number like this is of course not stable once the catalog is being changed (e.g., events removed).

As a more robust alternative I was thinking whether it could be possible to store the ResourceIdentifier directly in the dataframe, but I guess that's a bit too experimental yet (TypeError: dtype '<class 'obspy.core.event.resourceid.ResourceIdentifier'>' not understood):

from collections import OrderedDict
from obspy.core.event.resourceid import ResourceIdentifier
EVENT_DTYPES = OrderedDict(
    time="datetime64[ns]",
    event_id=str,
    event_rid=ResourceIdentifier, )

EVENT_COLUMNS = list(EVENT_DTYPES)

import obspy.core.event as ev
from obsplus.structures.dfextractor import (
    DataFrameExtractor,
    standard_column_transforms, )

events_to_df = DataFrameExtractor(
    ev.Event, required_columns=EVENT_COLUMNS,
    column_funcs=standard_column_transforms, dtypes=EVENT_DTYPES,)

cat_df = events_to_df(cat)

d-chambers commented 3 years ago

Ah, I see. I would really not like to include the ResourceIdentifier in the dataframe representations for two reasons. First, right now all the columns are basic, easily serializable types. Adding custom class instances can greatly complicate the serialization/deserialization process. Second, after fighting with ResourceIdentifier for several years, I am of the opinion it just doesn't (and cant) work very well so I personally try to avoid using it.

However, I think there are at least two ways of doing what you want in a more efficient way.

The first, for simpler queries you can just use the catalog.get_events method. This should behave almost identically to the fdsn client's get_events method.

import obspy
import obsplus

cat = obspy.read_events()
cat_filtered = cat.get_events(min_magnitude=3)

For more complicated queries/manipulations, why not just store the events in a column of the dataframe?

import obspy
import obsplus

# make sure the order of the catalog doesn't change until this block finishes
cat = obspy.read_events()  
df = obsplus.events_to_df(cat)
df['events'] = cat.events

# now you can sort, query, filter, do whatever you want to the dataframe!
df = (
    df.sort_values('time')
    .query('depth > 2_000')
    .query('magnitude_type == "ML"')
)

# reassemble catalog from the processed dataframe
cat_out = obspy.Catalog(events=list(df['events']))

Would that work for you?

d-chambers commented 3 years ago

Hey @flixha,

Did the above suggestions work for you? If so I will close the issue.

flixha commented 3 years ago

@d-chambers Sorry for not replying earlier - I had started to test a few solutions that you suggested and they indeed work very well, so thanks a lot for pointing these out! I like the solution of adding the event itself to the dataframe, hadn't thought about it being that easy actually ;-) .

I understand that from your tests and usage you have a general "mistrust" for the ResourceIdentifier-things in obspy, so that's also very valuable to know that it may be best to avoid using these when not required - so I'll keep that advice !

niosh-mining / obsplus

attach_all_resource_ids doesn't fully fix resource-ID linking? #246