mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Residual problem with graphics module #343

Closed pavlis closed 1 year ago

pavlis commented 1 year ago

I'm afraid issue #332 was closed prematurely. Well, the main issue was resolved but the original reason I reported the issue was not resolved. Our graphics tutorial is broken for the section discussing SeismogramEnsemble plots.

The problem was we didn't close on an inconsistency between what I had locally to resolve the issue and what was actually implemented. In our C++ code base is the a function called EnsembleComponent. It behaves exactly the same as a related function called ExtractComponent in ccore. Both extract a single component of data from 3C data. The difference is one is for ensembles and the other atomic Seismogram objects. #332 resolved this problem by creating a python wrapper called ExtractComponent that handles type mistmatches the pythonic way. It uses a set of isinstance calls to deal with strong typing inputs from ccore.

The first order problem that happened is we didn't fix graphics.py to match this change. The current graphics.py module has the following import lines:

from mspasspy.ccore.algorithms.basic import ExtractComponent, EnsembleComponent
from mspasspy.algorithms.window import scale as alg_scale

Those lines resolve but the first line should be this:

from mspasspy.algorithms.basic import ExtractComponent

That, however, leaves unresolved references to EnsembleComponent. There are six in the graphics.py and they can be fixed in 30 s with a global change all EnsembleComponent to ExtractComponent.

We need apply that patch asap as the graphics tutorial is broken until we fix it. The way the fix can be tested is simple: just run the graphics tutorial and see if it works.

Had I known how to do so I'd have just edited that file on github to make that fix, but think preserving this for the record is important anyway. The reason is that I think there is some additional cleanup we should do with this.

  1. Having a ccore function and a python function with the exact same signature but different behavior is a bad bad idea. Step 1 is to change the pybind11 code to define the ccore version to be _ExtractComponent.
  2. A tougher decision is what to do about the function now called EnsembleComponent in ccore? Right now the wrapper code for ExtractEnsemble to handle this is a little weird and hence confusing:

    
    if isinstance(data, Seismogram):
        try:
            d = bsc.ExtractComponent(data, component)
            return d
        except Exception as err:
            data.elog.log_error("ExtractComponent", str(err), ErrorSeverity.Invalid)
            empty = TimeSeries()
            empty.load_history(data)
            empty.kill()
            return empty
    elif isinstance(data, SeismogramEnsemble):
        if data.dead():
            empty = TimeSeriesEnsemble()
            empty.elog = data.elog
            empty.kill()
            return empty
        try:
            d = TimeSeriesEnsemble(bsc.EnsembleComponent(data, component))
            return d
        except Exception as err:
            logging_helper.ensemble_error(
                data, "ExtractComponent", err, ErrorSeverity.Invalid
            )
            empty = TimeSeriesEnsemble()
            empty.elog = data.elog
            empty.kill()
            return empty

    The confusing of using the bsc qualifier for both Seismogram and SeismogramEnsemble sections could be eliminated if we first fix the pybind 11 wrappers as noted above. The weirder construct is:

    d = TimeSeriesEnsemble(bsc.EnsembleComponent(data, component))

    That line, at least, screams for a comment to explain why the secondary copy is needed. That should probably be done, by the way, for the initial patch for this problem. The BROADER PROBLEM this illustrates, however, is something discussed previously in #332. We need to clean up the handling of Ensembles. The fundamental issue is during the evolution of ensembles we extended the original ccore Ensemble templates to define a a set of LoggingEnsemble templates. The later added elog. For consistency with our atomic data objects we really should go through the C++ code base and change a occurences of Ensemble to CoreEnsemble. That change could be isolated in the C++ code by having the pybind11 code handle CoreEnsemble in the same way we now handle CoreTimeSeries and CoreSeismogram. That is, those symbols are treated as private with appropriate use of "_" to clarify the namespace (i.e. in python the a C++ CoreTimeSeries would only be visible as _CoreTimeSeries and a Seismogram would only be visible as _CoreSeismogram).

The second part of this proposed fix is not critical, BUT it is better to do it now than later. The longer we delay the more confusion it will create. It will definitely create a long-term maintenance issue if we don't deal with it.

pavlis commented 1 year ago

A not so small postscript to the previous. In trying to get graphics to work I came to realize we had some additional unresolved issues with graphics.py. There is a sizing problem that happens somehow with the imageplot version. Whoever gets assigned the task of fixing the above might want to look for other issues in graphcs.py. There are two I know of:

  1. As noted the imageplot scaling is not working right with plots of small number so things. I think there is a bug in the aspect ratio calculation I'm not finding.
  2. The much more difficult issue to fix is an inconsistency in how the class handles pyplot handles. I particularly note this method that always returns an array of pyplot handles:
    def _wtva_SeismogramEnsemble(self, d, fill):
        # implement by call to ExtractComponent and calling TimeSeriesEnsemble method 3 times
        # should return a list of 3 gcf handles
        figure_handles = []
        for k in range(3):
            dcomp = ExtractComponent(d, k)
            # figure_title='Component %d' % k
            # pyplot.figure(figure_title)
            pyplot.figure(k)
            handle = self._wtva(dcomp)
            figure_handles.append(handle)
        # pyplot.show()
        return figure_handles

    When that private method is called internally, however, it is handled this way:

    if isinstance(d, SeismogramEnsemble):
            if len(d.member) <= 0:
                raise IndexError(base_error + "ensemble container is empty")
            self._wtva_SeismogramEnsemble(d, fill)
            self._add_3C_titles()

    which means the handles are just dropped.

The fundamental issue is when I wrote this code I had planned ahead by trying to have the code return the plot handles. That could and should be a simple way to write an extension class of SeismicPlotter that had enhanced features. I obviously never tested that I did all that right. The basic message is we need to have both the classes in this file updated to consistently handle returns of pyplot handles from any plot.

pavlis commented 1 year ago

Let me add to the comment above about changing the name Ensemble in the C++ code base everywhere to CoreEnsemble. There is a bit more to this than just one name change.

  1. For consistency with atomic data objects we should ALSO change LoggingEnsemble to Ensemble. Then an Ensemble relative to CoreEnsemble is like TimeSeries is to CoreTimeSeries.
  2. The current layout has a mismatch with the atomic data we should fix. Specifically, the live/dead concept is currently implemented in LoggingEnsemble. Implementation of the live/dead concept should be in the base class for consistency with TimeSeries and Seismogram.

There is a residual, much thornier issue that would benefit from discussion before we proceed: ProcessingHistory. The TimeSeries and Seismogram objects inherit a ProcessingHistory object (the assumption is a TimeSeries, for example, "is a" ProcessingHistory" instead "has a" ProcessingHistory. That is particularly problematic because most people would say "has a" is a better answer than "is a". I take responsibility for that design decision because back about 2 years ago I had some issue I couldn't resolve doing that and decided to define it as "is a". In retrospect that was a bad idea, but in my view it would be way way to hard to back that out now as it would break a lot of things.

What this raises is if we should add a ProcessingHistory to the data formerly known as LoggingEnsemble for consistency with the atomic data? If the answer is yes we'd have to address the "is a" or "has a" question again, but I assert we should not include PrecessingHistory in the revised Ensemble==LoggingEnsemble. There is a fundamental reason: An ensemble will break the constraints in the design of ProcessingHistory. The reason is deeply in the weeds of the implementation. A simple summary is that the ProcessingHistory container assumes a tree structure that is limited to map-reduce. That is, the nodes of the tree are defined by a processing step and the tree is inverted meaning it begins with the leaves and terminates at the root. If an ensemble appeared in the middle of a chain the tree could grow both ways and I am pretty sure the implementation will not handle that situation. Hence, I think we should continue to take the position that only Seismogram and TimeSeries are atomic and ensembles are a different thing. Note, this issue will carry over to our under construction ArrayEnsemble (or whatever we call it too. I personally don't think the interest or need for processing history as a concept is worth the investment it would take to retrofit the entire code base to handle ensembles. @wangyinz please rule on this question for the record.

If you accept my plan above I suggest we address this issue in two steps:

  1. We continue, as decided in yesterday, to first apply the patch to graphics.py I described in the first box of this issue page.
  2. When the patch is done the job of fixing the C++ code base is likely best done by me (@pavlis). Fortunately, we can insulate the python code base from this fix as the pybind11 code can and already does alias the C++ names. i.e. I am 100% sure we can make a TimeSeriesEnsemble and SeismogramEnsemble 100% backward compatible with existing python code.
Aristoeu commented 1 year ago

This secondary copy is to convert the return type from mspasspy.ccore.seismic.CoreTimeSeriesEnsemble to TimeSeriesEnsemble. If there is no copy, _wtva() function in graphics.py (line 826) will raise RuntimeError because the type is not correct. _wtva() function does not support mspasspy.ccore.seismic.CoreTimeSeriesEnsemble type, it only supports TimeSeriesEnsemble type.

d = TimeSeriesEnsemble(bsc.EnsembleComponent(data, component))