mantidproject / mantid

Main repository for Mantid code
https://www.mantidproject.org
GNU General Public License v3.0
210 stars 121 forks source link

Filtering logs re-design #34794

Closed peterfpeterson closed 1 year ago

peterfpeterson commented 1 year ago

Is your feature request related to a problem? Please describe.

Example of the problem (assume orange parts are the parts to keep) image image The current filtering of logs inserts new values for the logs at the left and right edges of the times. This changes the simple mean, and expands the amount of memory needed for a single log incredibly.

The solution will give the correct results for (being python-centric) the mantid.api.Run to give back correct results for getProtonCharge() (which should be a sum of the proton_charge log) and getPropertyAsSingleValueWithTimeAveragedMean() and getTimeAveragedStd(). It will also simplify the code for filtering the events themselves so it is easier to understand the intent and refactor for better performance.

High level changes

  1. filtered logs will be copies of the original: no extra or removed values
  2. TimeROI will be added to the Run object to allow for calculating statistics and showing logs in the log viewer
  3. The code will no longer support splitting a single event into multiple output workspaces. This can be achieved by splitting the input a second time
  4. TimeROI will be used directly in the event filtering code
  5. TimeSplitter will be used directly in the event splitting code

Describe the solution you'd like

TimeROI object

The solution is to introduce a new object, TimeROI which contains information about when the workspace was active. This is in contrast to the current implementation which copies parts of the logs and adds artificial log values at the boundaries of when the workspace is active and not active.

The new object, TimeROI will use TimeSeriesPropertyBool as a private member for much of its implementation details. false will mean that the time period is to not be included and true means that the time period should be included. Events or log values outside of the time-range of the TimeROI are not included, but may need to be inspected when calculating simple mean values for things that log on change.

Since a user may filter/split a workspace multiple times, the TimeROI object needs to know how to be merged with another TimeROI. At least internally, it will have union and intersection methods.

Changes to the Run object

The Run object will get a TimeROI added to it so it can return the the correctly integrated time average mean and stddev. The TimeROI can be set or updated which will clear the LogManager.m_singleValueCache. Run will also get its own getStatistics(str) to facilitate returning the correct values there as well which should also update LogManager.m_singleValueCache. Setting the TimeROI on a Run will require recalculating the gd_prtn_chrg via Run.integrateProtonCharge(). The Run object will have to take on some amount of responsibility for TimeSeriesProperty.filteredValuesAsVector so the traditional mean, stddev, median, min, and max get the correct values. The main point is that often the last value before the TimeROI includes data will need to be returned additionally.

LogManager::getPropertyAsSingleValue (resolves as Run's method) eventually calls LogManager::convertTimeSeriesToDouble needs to be modified to calculate the time average value with the TimeROI taken into account.

TimeSeriesProperty::timeAverageValueAndStdDev should move into the Run or LogManager and account for the TimeROI.

TimeSeriesProperty.getStatistics() and related methods will get an optional TimeROI argument for allowing for getting a "correct" calculation rather than the default which will calculate as it currently does.

PropertyManager and TimeSeriesProperty will lose their methods filterByTime, splitByTime and filterByProperty

The FilteredTimeSeriesProperty will go away.

It appears that TimeROI is a variation on the concept of TimeSplitterType which is actually std::vector<SplittingInterval>. SplittingInterval is a start and stop type with an index of the destination workspace. If TimeROI is a tweaked version of TimeSeriesProperty<int> this behavior could be preserved with the exception of not allowing for a single event to be filtered into multiple outputs. That behavior would require creating a second TimeROI.

The "log viewer" in mantid will plot the full log of a workspace, with an extra option (i.e. checkbox) to show the regions where the TimeROI is not selecting the data with shading.

Changes to the Statistics object

Kernel::Statistics should get extra items for time-average-mean and time-average-stdard-deviation. The values for these if they are not calculated (like all the others) is nan.

Goiniometer object

Goniometer uses logs values

Run::calculateAverageGoniometerMatrix calls Run::getLogAsSingleValue several times and should be modified to call the new Run::getStatistics. Otherwise this is already using the Kernel::Math::TimeAveragedMean which is correct.

Instrument object

Instrument uses log values

The instrument uses log values via InstrumentDefinitionParser. The log is used in ParameterValue::operator double() which calls XMLInstrumentParameter.createParameterValue(Kernel::Property *) and has its own implementation of determining the value. This should be changed to XMLInstrumentParameter.createParameterValue(const API::Run &, const std:string &) which would use Run::getLogAsSingleValue(const string &, const StatisticType) to get the value then use the rest of the structure to use it appropriately.

TimeSplitter object

While TimeROI is used in filtering events, TimeSplitter is a new object that uses a TimeSeriesPropertyInt for much of its implementation and is used for splitting workspaces. A value of -1 is a special value for not including data into any output. It will have methods for converting itself into a TimeROI object for a given output workspace index. The TimeSplitter will also have a field(s) that captures the string from the InformationWorkspace that is passed between GenerateEventsFilter and FilterEvents. This is to simplify logic in FilterEvents. The TimeSplitter will generate an error if, at the end of construction, it has any times that are intended to go to multiple output workspaces.

Changes to EventList

EventList does much of the heavy lifting of filtering and splitting via its "helper" functions:

Many of these functions can be combined into three functions:

Changes to algorithms

This does not affect all algorithms in the events category, only those in event filtering category.

FilterEvents algorithm

This is the heart of where the other changes come together. link to docs

All 3 types of inputs (MatrixWorkspace, SplittersWorkspace, and TableWorkspace) will be converted into a TimeSplitter. This will also consume the information from InformationWorkspace (sets the title and comment of outputs), FilterByPulseTime (which will require using the proton charge log), OutputWorkspaceIndexedFrom1, OutputWorkspaceNames, RelativeTime, and FilterStartTime.

The events themselves will be split using the TimeSplitter and the new functions in EventList.

The Run object will be copied into all of the output workspaces, then the TimeSplitter will be asked to create a TimeROI for each output workspace index which will be added to each of the associated output workspace's Run object.

Implementation notes

The implementation should be in the following order

  1. Create TimeROI object
  2. Create functionality to take various was of specifying filtering and convert to TimeROI.
  3. Modify Statistics object
  4. Modify Run object to allow for TimeROI to be attached
  5. Move functionality from TimeSeriesProperty to Run. At this stage previous implementation can stay in-place with deprecation annotations
  6. Write new methods to filter events using TimeROI and add deprecation annotations to old methods (copy if, and erase)
  7. Create TimeSplitter object
  8. Create functionality to take various existing ways of specifying splitting and convert to TimeSplitter. This will throw an exception if a single time period can go to more than one output workspace index.
  9. Write new methods to split using TimeSplitter and add deprecation annotations to old methods
  10. Modify algorithms to move from old to new functionality
  11. Track down all calls to deprecated functions and replace with new API calls
  12. Remove old deprecated methods

The functionality to add TimeROI and add TimeSplitter can happen in parallel, but must be before modifying the algorithms.

Things that can be done out of order

Additional context

Since the changes will be made to algorithms incrementally the deprecated annotation can be used with -Wdeprecated-declarations to track down places that are using the "old" functionality.

The issue discovered in #27556 should be fixed as part of this work

The equations for calculating the time-average mean with a TimeROI are in section IV of this paper with M(t) representing the TimeROI with appendix A containing an example. The challenge is to discretize the equations.

This is associated with:

jmborr commented 1 year ago

@peterfpeterson It's unclear to me why the TimeMask data structure is a TimeSeriesPropertyBool instead of a TimeSeriesPropertyFloat. In the picture below I drew a mask that requires six floats (the time boundaries).

Capture

Nevermind, I was ignorant that the TimeSeriesPropertyBool is a list of time points plus a value associated to each time point

peterfpeterson commented 1 year ago

I appreciate that you're looking into how the TimeMask would actually work. A lot of these cases should be turned into unit tests for Run::timeAverageMean(const string &)

AndreiSavici commented 1 year ago

In addition, some unit, docs, or system tests might fail. We should keep track of these, and see if the changes make sense (don't just use new numbers for the output).

peterfpeterson commented 1 year ago

Review Notes:

The name TimeMask is being changed to TimeROI to be more consistent with other mantid language.

martyngigg commented 1 year ago

I generally like the concept here. The current filtering is quite confusing to understand. A question about the high-level changes:

High level changes

filtered logs will be copies of the original: no extra or removed values

It feels like the filtered log should just be a view onto the original data. Do we need to copy or can we just reference to save even more memory?

peterfpeterson commented 1 year ago

That is a good idea. It would be facilitated by the handles on the logs being copy-on-write pointers.

mantid-builder commented 1 year ago

Fixed by #34938