How to tag data saved at different stages of processing?

mspass-team / mspass

Massive Parallel Analysis System for Seismologists

https://mspass.org

BSD 3-Clause "New" or "Revised" License

32 stars 12 forks source link

How to tag data saved at different stages of processing? #43

Open pavlis opened 4 years ago

pavlis commented 4 years ago

We need to come up with some ideas on how to tag objects to define what had been done to them to get them from some initial stage to the current state they define. Provenance aware was a key word in your original proposal and this is where the rubber meets the road on that goal.

Now that I have worked with mongo for a while I have some ideas on ow we might approach this, I'm still a rookie, however, so I think I'd like to hear your ideas before I start casting about for fish in waters I don't know that well. I'm thinking you may have thought about this some since it was a key goal in your funded proposal so I'll let you go first.

pavlis commented 4 years ago

Ok, here is what I can guess for memory use of copying compared to shared_ptr.

In the current implementation a ProcessingHistoryRecord has four main attributes that are required for each algorithmic step:

  ProcessingStatus status;
  std::string algorithm;
  std::string instance;
  std::string id;

The trouble comes from inputs which really should be:

vector<ProcessingHistory> inputs;

vector<list<ProcessingHistoryRecord>

instead of what I had above. Either of those would work for the copy version.

For shared_ptr that vector would need to be:

vector<shared_ptr<list<ProcessingHistoryRecord>>> inputs;

both models have the vector and linked list as containers with some intrinsic overhead, but the equal for both.

Each copy of a ProcessingHistoryRecord has:

8 bytes for the status enum (I'd guess)
say average of 8 or 10 bytes for algoritm
one or two bytes for instance, as I expect it to usually be 0, 1, 2, ..., n where n is a small number
I think a stock uuid string is 36 bytes

Total: about 60 bytes plus overhead.

The shared_ptr class is not just a pointer. The documentation says it requires at least two pointers. We'll use that so 2*8=16 bytes. That is about a factor of 4.

What is the largest history chain we can imagine? I worked on this a bit and think the history mechanism will scale by this approximate scaling law:

$M=\prod_{k=1}^{p}(n_{k}hsr) + bn_1$

where M is total memory use. "prod" is spelling out the pi symbol use in math to denote a product sequence - like sum but the components are multiplied not added. p is the number of processing steps in a workflow so there is one term per processing step. n_k is the number seismograms at step k, h is number of bytes per history record, s is the number of sources combined in a step, and r is the number of receivers combined in a step. b is the base memory use for the containers. The last term is an upper bound as during a run the size is actually bn_k at any step. I couldn't figure out how to easily state that with an equation in symbols knew. It doesn't matter much as the product term will normally dominate anyway.

First, evaluate my mathematical model skeptically. if it is correct, however, it illustrates a couple things:

The problem is less h, number of bytes per history record, than the product relationship. The worst case I can think of (other than an iterative algorithm) comes from something like a variant of the pwmig workflow I've thought of where you have three levels of stack: (a) source side stacks, (b) deconvolution in the way Scott Neal originally developed (decon of pseudostation stack with no moveout), (c) plane way decomposition to create more seismograms than input data, and (d) plane wave migration. I could try some real numbers if we need to at some point (e.g. a publication discussing our final design), but for now I think the key point is the product relationship is the issue not the size per record.
Because it could get out of control using shared_ptr as the mechanism is the right choice even though the difference seems only about a factor of 4. The model shows why it is a key factor - as a common factor in the product it scales as a power law with the number of processing steps.
A solution to the memory bloat problem we MUST implement is a save to db and clear mechanism. That is, we'll need to develop a python module that will dump the current history data to MongoDB and reset the history chain providing a mechanism to link the newly created chain to the stored history data. User's will need to be warned to use that tool if they are doing large stacks or iterative algorithms.

My next step is clear - I'll continue to try to finalize the new api revisions using the shared_ptr model. Give some thought to how we need to save the history data in mongo to allow the essential clear mechanism noted immediately above.

wangyinz commented 4 years ago

That is indeed a very good point. Well, we all knew there will be overheads from the history mechanisms, but what you wrote above explicitly spells out how much that could be. I can't agree more that a clear mechanism will be a must to avoid potential memory problems.

Because we have already separated database related functions from the C++ code, I think a sensible way that follows our design will be implementing a dump method and a clear method on C++ side. Then, we will call the methods under Python when saving the history records. More importantly on the C++ side is that a reset method will be needed. To my understanding, a reset is actually not too much different from a clear method - all we need to do is drop the inputs vector so that the certain data object will appear as the parent for the subsequent processing steps. This will only work after the dropped data object's records (as well as current data object) are saved to db.

Above will only work for a workflow executed from Python side. The memory bloat problem still exists for something like a large iterative algorithm running on the C++ side. I don't think there is a way to resolve that without implementing database operations on the C++ side. But, we don't necessary have to solve all the problems under all scenarios - we can do it down the road if that turns out to be what users would like to see. For now, I think we can just make the history mechanism optional for such an algorithm.

p.s. based on the equation, if we use the shorter objectid instead of uuid, we can potentially further reduce memory use significantly.

pavlis commented 4 years ago

Making progress on this. Got the main classes to at least compile with api changes, but when I got to converting existing algorithms to the new api I ran into an issue I would like your input upon before I proceed.

The issue is that to make the ProcessingHistory lightweight the data stored and api to access what I've called a ProcessingHistoryRecord in the C++ code is obscure at best. Furthermore, when I looked at the old code, for example, for the 3C agc I realized that demanded an api change to run. That is, if agc is to be set up to create a valid history record it needs to set three attributes that define that record: algorithm, instance, and id. (status is internal) algorithm is not a problem since any algorithm should be able to define it's own unique name. id is supposed to be set by the algorithm with a call to new_id so it isn't an issue - recall id is the uuid assigned to this generation of processing of data in a chain so each algorithm will need to generate a new one to assign to it's output. The issue is instance. instance is set as an std::string so it could either be a simple thing like "0" or "1" or it could be a mongodb objectid - not clear yet what we should use for reasons I'll expand on in a moment.

I see two ways to solve this issue:

Just add instance as a argument to any algorithm and assume the python script will handle the issue. Easy path for me, but would create a major point of confusion for any user using this feature.
Enhance the existing concept I have called ProcessManager and pass a reference to ProcessManager to provide a mechanism to handle creating a correct history record in a standardized way.

Either way an algorithm using the history mechanism will need some global method to sort out what "instance" means; in the first case we put it on the python side and the second we put it on th C side. I tend to prefer 1 because doing this on the C side causes problems unless we want to tackle the C++ mongodb api. In particular, if we build this in python it would be clean to have "instance" be a objectid string. With C++ that would require a lot of work. Do you concur?

If we accept that as the right design, I suggest that means we will need to design a ProcessManager class in python. The C++ definition above provides a starting point. In this model a C++ implementation like agc would need a python wrapper to get instance. I think a further advantage this might provide is a cleaner way to add a history mechanism to an obspy Trace object - that is pure conjecture.

If you concur with this suggestion we will do two things:

All algorithms implemented in C++ we currently have in mspass would need to add an instance argument and code to handle managing history OR be associated with a python wrapper that handles the history and instance. I tend to prefer the later model.
No matter how we do 1, we will need a decision about how we should do ProcessManager. If we do it in python, which I think is the right model, I would suggest you should design and write it building on the dialogue above.

wangyinz commented 4 years ago

Yes, I concur with all the suggestion above. We will handle the global level history in Python. Not hundred percents sure about what exactly you mean by associating with a python wrapper, but this is how I would design it: first of all, the global level history related component (i.e. instance) should be optional in the C++ api so that presumably users may use it to build something simple without getting into the weeds. Actually, in such case the object level history may also be optional. Then, we only expose the instance as an argument with pybind11 so the global level history mechanism will work on the Python side.

In terms of process manager in Python, I tend to first explore existing packages that does similar things, and I found an interesting project here. Note that this appears to be a non-active project, so I do not expect to use it directly. What's interesting there is that it shows a way to do data provenance with Spark that is independent of underlying applications it executes. This is what we want in our global level history management: not only to handle native algorithms but also external ones that users might run under the Spark & MongoDB framework.

pavlis commented 4 years ago

Ok, I have a partially completed implementation of the C++ code with the new API revisions. I did not issue a pull request because what is there is totally untested - it compiles but I know it has some issues. No test code has been run and I had to disable the pickle wrappers for Seismogram and TimeSeries because i wasn't sure how to serialize the shared_ptr code (seems intrinsically dangerous by the way).

The main purpose of pushing the new branch (ccore_APIchanges) is to allow you to see what I've done and look for anything you can anticipate as a problem before we start fine tuning this.

I suggest you spend some time looking through this to get a quick idea of what I've done looking for any obvious flaws. Before you spend too much time on it, however, it might make more sense for us to have a zoom call today so I can lead you through this. There is a lot here and the code and my comments will not be sufficient to educate you on the ideas. Once we get that finalized I'll need to update the documentation for sure because this has some major changes.

One last thing - you should largely look at ProcessingManager as a prototype API for a python implementation of the concept it tried to implement. I think we decided that would become a python thing in our exchanges above on this thread.

pavlis commented 4 years ago

In our call yesterday we discussed an idea for avoiding memory bloat when preserving processing history. We discussed the idea of always dumping history to mongodb if the data were serialized. I realized when looking at the implementation and considering that idea that it would not work without having some global mechanism for the ProcessingHistory class to have a hook into MongoDB. The issue is the if using pickle's save and load it would not normally have access to MongoDB unless there was a hook inside the ProcessingHistory class. I do not think we want that.

I am also very concerned we are going into a dark alley if we continue to try to implement the ProcessingHistory object with a smart pointer (shared_ptr) instead of a copy operation. Unless you strongly object, I'm going to replace all the shared_ptr references with a copy mechanism. i.e. instead of the history data being referenced in C++ like this:

list<shared_ptr<ProcessingHistoryRecord>>

they will be referenced like this:

list<ProcessingHistoryRecord>

We will want to then implement an automatic save/clear ProcessingHistory by writers. That is a writer should have an optional save_history=True argument. When the default of true is set history will be dumped and cleared automatically. If history were empty it would do nothing with the history data. Only if the user sets this False would the history data be retained in memory on a save. (Again if the history chain were empty it would do nothing) I think with this approach we just need to warn users that if they have an algorithm that is generating massive history chains they will need to use a clear mechanism to avoid memory bloat.

We will probably also want a writer function that doesn't actually save data but does save the history data, clear the chains, and sets them up as an origin. If you agree to that, we probably need to define that as a different state than ORIGIN or RAW. ORIGIN and RAW imply the top of the chain is linked to a waveform that was saved, but if we only save the history data that would be misleading at best. If you concur I will add that option for the ProcessingHistory and add a new method that handles that new case. In any case, it is imperative that the id in a record defined by a writer (status==SAVED) history record be the objectid of the saved history data NOT the saved data.

I just pushed an update to the ccore_APIchanges branch that fixes the more trivial problems we discovered yesterday. I want to clean up this shared_ptr thing before I start to construct any kind of test programs.

wangyinz commented 4 years ago

Well, I am OK with the copy mechanism. It is straightforward but just less efficient in local memory usage especially for iterative algorithms applied to large data sets.

Actually, the root of the problem is not that the smart pointers cannot be serialized/restored easily. It is the serialization/deserialization will be very expensive for a large and complex history chain. It is because each ProcessingHistoryRecord itself contains a vector<shared_ptr<list<ProcessingHistoryRecord>>> inputs. By expensive, I think you might be thinking of that the vector of shared_ptr implies a large number of dereferencing operation when serializing and smart pointer construction overheads when deserializing, and these could be addressed by the copy mechanism at the cost of using more local memory. But what I still worried about is not such time overheads but really the space overheads from the potentially bloated history records. This is more of a concern because the serialized object will likely be passed to a different node through network (which is what Spark does with pickle) and we know from experience the overheads (in both time and space) in communication will likely be the bottleneck that dominates. This is what the clear mechanism will address. Thus, I don't think whether having the smart pointers will make any difference for the overheads in the distributed framework, it is only a trade-off between local time overhead and local space overhead.

To address the communication overhead, I think the save history mechanism you described is a good one. We won't know how well it works until we can test it out with a real workflow, but I think as long as we are following the principle of separating object level history records from the global level, we should have no problems anyway.

pavlis commented 4 years ago

As noted in an offline email exchange, I realized my initial design for the new version of history had some fundamental flaws. I thought this through some more and suggest the following C++ data structures be managed in a protected data area of the ProcessingHistory class designed earlier. There may be a few minor changes to the api of that class, but I don't think they will be huge. This just implements the concepts differently. Let's just start with the code:

/* This map defines connections of each data object to others.  Key is the
uuid of a given object and the leaves are the inputs used to create it. */
multimap<string,string> nodes;

/* This map defines the process used to create a given uuid (the key). */
map<string,ProcessDefinition> process_data;
/* map defining the status of each uuid */
map<string,ProcessingStatus>  status;
/* This one may not be necessary, but it defines the number of processing
stages required to create each data object keyed by uuid */
map<string,int> stage;

The nodes multimap is like inputs was before, but centers on uuids. Any time an algorithm creates a new set of data it would get itself a new uuid and then create an entry for each parent uuid in the nodes multimap. What is really good about that in an implementation is the multimap can be appended as inputs are worked through by an algorithm. That will make any process easier to handle because if one data are dropped for any reason they just don't get added to nodes. This has much better scaling than the previous design - this multimap scales only slightly more than the total number of inputs used to create the current data.

process_data is keyed by uuid and defines the algorithm instance used to create these data. We only need objectid string to define an instance. The wrapper for an algorithm should take the uuid of newly created data and create an entry for each object it creates. This map will then scale by the total number of inputs used to create a given data object.

status is a better way to define this property than mix it into the ProcessingHistoryRecord as was done in the previous design.

stage is optional. Not clear if it is necessary, but might be a good sanity check. Certainly would make asking how many stages data had been processed easier. Without this a method to answer that question would have to reconstruct the processing tree.

wangyinz commented 4 years ago

This design of ProcessingHIstory is definitely more clean, but I am confused how ProcessingHistoryRecord will be linked with in this new design. It seems the two are now only weakly associated through uuids, then we need to think of how ProcessingHistoryRecord can be referenced when we need to save the history records. Also, I just realize this new design is conceptually similar to the previous one with shared_ptr where the pointer is replaced by uuid.

pavlis commented 4 years ago

I see your confusion. I didn't defined ProcessDefinition. Cut and paste error. This is it without any decoration (i.e. just a struct):

class ProcessDefinition
{
  string algorithm;
  string oidstr;   // ObjectID string of the parameters used in that instance
}

So a method would look up a ProcessDefinition record using a uuid. This all just exploits the idea that a uuid is unique and it simplifies the data structures, although definitely not all operations.

Haven't written any of this down yet but in mulling this over I realized I think we will eventually (not necessarily immediately) a pair of children of ProcessingHistory. A HistoryPrinter would have methods to make some text dump of history data. I may write a crude version of that one right away just for testing. A tougher problem would be a HistoryVisualizer that would use some package for drawing tree structures to graphically show the history data. There appear to be lots of packages out there for drawing tree structures because that is such a common visualization need. If you have any experience or could ask someone for advice on that it might be useful for debugging this stuff.

wangyinz commented 4 years ago

hmmm... I am not sure I get it without looking at the whole thing. What confuses me is that I don't see anything with the type of ProcessingHistoryRecord, so I assume when trying to access the history we first find the relevant uuids in ProcessingHistory and then retrieve the ProcessingHistoryRecord object with that uuid. So, how can we identify the object with a uuid? I guess some kind of search method will be used, but then we will need a container for all the ProcessingHistoryRecord objects. Is that right?

wangyinz commented 4 years ago

Or, maybe we don't need ProcessingHistoryRecord object anymore? Everything about the structure of the history tree is now in ProcessingHistory and we use ProcessDefinition to link them to the algorithm and parameters.

pavlis commented 4 years ago

Ouch, I think I may have just uncovered a fundamental flaw in the new design for the history mechanism. I was using a std::multimap as the data structure to maintain the processing history tree. The "ouch" is that I just realized that I do not see support on pybind11 for stl:multimap. A reason is likely that there is no clean connection of a multimap to any standard python object. Have you ever used a multimap? It is just an associative array (map) but maps many to many instead of one to one. Before I do a big web search or start over perhaps you have some guidance here. Note in the current implementation I have these protected multimaps:

protected:
  /* This map defines connections of each data object to others.  Key is the
  uuid of a given object and the leaves define the inputs used to create it. */
  std::multimap<std::string,mspass::NodeData> nodes;
  /* This map connects each algorithm to uuids it created linked to the
  current data.  Note the indexing is reversed from nodes.  This
  is used to find uuids that a given algorithm produced.  Note the
  comparison function required for weak ordering is defined above. */
  std::multimap<mspass::ProcessDefinition,std::string,AlgorithmCompare> process_data;

I am at the point that I am pretty confident this provides a lightweight solution for our history mechanism. I should point out that NodeData and ProcessDefinition split up the same data I set as ProcessingHistoryRecord before. I can post more details if you want, but I think right now that would only confuse this issue.

I hit this oh oh moment when I went to implement methods that would return a representation of these data structures that could be understood by python. When I assumed there were wrappers for a multimap I planned to just get a deep copy of them and write a python function to translate them and post the data to MongoDB. i.e. I thought I could just do:

multimap<string,NodeData> get_nodedata()
{
  return nodes;
};

But unless pybind11 has something undocumented for an stl::multimap that won't work.

In writing this I think a solution may be to return nodes and process_data data as string that can be passed directly in python to MongoDB. That might be prudent anyway.

In any case, I could use some input on this before I proceed.

wangyinz commented 4 years ago

Yeah, I did a quick search and I think you are right, there is nothing about multimap can be found in pybind11, and I don't even see any discussions about it. I guess the only thing we can do is to manually implement a type_caster<multimap> similar to what we have for boost::any. Based on this discussion, we can use the defaultdict in Python.

pavlis commented 4 years ago

I suspected as much from the brief search I did - i.e. that python does no have a multimap equivalent.

I thought of a way to make the C++ layer pure and move all the problem to a single procedure. This idea is an extension of what I mentioned above.

Implement getters in ProcessingHistory that do what I said above. i.e. they return a copy of the nodes and process_data multimaps.
Implement a C++ procedure (function) that takes a reference to a ProcessingHistory data object, calls the getters, and converts the contents either to a form that can be passed to python for saving to MongoDB. Presumably a clean way to do that is to translate the multimap to a JSON structure which I think can be efficiently dumped to MongoDB . The actual format is an implementation detail. The key idea is to keep the C++ objects pure and use a new layer of C++ and python code to handle moving the actual history data to MongoDB.

This would have an advantage of making the python layer less obscure (if we do the C++ layer right) and not having it depend on a complicated pybind11 construct that could prove fragile down the road.

No matter how we actually do this, we will need a schema for storing the multimap concept. Mongo will handle the multiple key issue with no problem through the object id if we just create records like this:

wf_uuid : XXXXXXX
{
  status : RAW
  stage : 0
  type : Seismogram
  uuid : YYYYYYYY
}
wf_uuid : XXXXXXX
{
  status : VOLATILE
  stage : 1
  type : Seismogram
  uuid : ZZZZZ
}

Incomplete, but the point I am making is I think you can could save two records like that with value of wf_uuid and Mongo would create two copies with different ObjectIDs. A query to get a give wf_uuid should then return multiple documents with different wf_uuid keys.

Do I have the right?

I need to post the details of NodeData and ProcessDefinition, but need to go do a different computer to do that.

pavlis commented 4 years ago

Here is the current definition of NodeData:

/*! \brief Holds properties of data used as input to algorithm that created this object.

The implementation here uses a multimap to define parents of each uuid
in a history chain.   This class is used mainly internally for ProcessingHistory
to maintain that data.   It will be visible to C++ programs but will not
be visible in python.  One of these entries is created for each parent
data used to create the current data.
*/
class NodeData
{
public:
  /*! status definition of the parent. */
  mspass::ProcessingStatus status;
  /*! uuid of the parent. */
  std::string uuid;
  /*! This enum can be used to track changes in data type.  */
  mspass::AtomicType type;
  /*! Integer count of the number of processing steps applied to create this parent.*/
  int stage;
  /* These standard elements could be defaulted, but we implement them
  explicitly for clarity - implemented in the cc file. */
  NodeData();
  NodeData(const NodeData& parent);
  NodeData& operator=(const NodeData& parent);
};

and this is the current version of ProcessDefinition:

class ProcessDefinition
{
public:

  /*! \brief Name of this algorithm.

  We use the concept that every processing algorithm has a name keyword
  that togther with an id and/or instance defines a unique definition of
  the algorithm and a set of input parameters that define the algorithm's
  behavior.
  */
  std::string algorithm;
  /*! id string to identify this instance of algorithm.

  Only assumption is that the combination of algorithm and id provide
  a unique specification of a particular instance of an algorithm.  That
  means some algorithm and a particular set of control parameters that
  control the outcome of the algorithm.  In MsPASS this is usually the
  ObjectID string of saved parameters in MongoDB, but users can use any
  method they wish to describe a unique combination of parameters and an
  algorithm implementation. */
  std::string id;
  ProcessDefinition()
  {
    algorithm="UNDEFINED";
    id="UNDEFINED";
  }
  ProcessDefinition(const std::string alg,const std::string algid)
  {
    algorithm=alg;
    id=algid;
  };
  /* These could probably be defaulted but best to be explicit*/
  /*! Standard copy constructor.*/
  ProcessDefinition(const ProcessingHistoryRecord& parent)
  {
    algorithm=parent.algorithm;
    id=parent.id;
  };
  /*! Standard assignment operator. */
  ProcessDefinition& operator=(const ProcessingHistoryRecord& parent)
  {
    if(this!=(&parent))
    {
      algorithm=parent.algorithm;
      id=parent.id;
    }
  };
private:
  friend boost::serialization::access;
    template<class Archive>
       void serialize(Archive& ar,const unsigned int version)
    {
      ar & algorithm;
      ar & id;
    }
};

Noting this little gem needed to define a compare functional for ProcessDefinition:

/*! \brief Compare operator for ProcessDefinition.

We use a multimap to produce a backward reference of all uuids that
were the output of a given algorithm instance.   This operator assumes
the id of the ProcessDefinition alone defines a unique key.
*/
class AlgorithmCompare
{
  bool operator()(const ProcessDefinition p1, const ProcessDefinition p2) const
  {
    return(less(p1.id,p2.id));
  };
};

Note it basically uses the id as a key carrying the algorithm along as baggage.

pavlis commented 4 years ago

Update on this development. I got the above to the point it would compile and link the .so file for python wrappers. I then again started to write a C++ test program to test ProcessingHistory without the added complication of python. As with version 1 this again revealed problems. I'm not going back to the drawing board again like I did with v1, but I have to backtrack and make a new set of major revisions. For the record, here is my thinking about what worked, didn't work, and what needs to be fixed.

What I am 100% sure works now is that the multimap container with uuid is an extremely fast and efficient way to store history data. It is doubly valuable because it will serialize easily with boost for use with spark. User's won't need to know that detail and that is good.
The trick you may not yet realize is that it is now 100% clear to me that a general tree structure is "relatively straightforward" to construct from the multimap. I put that in quotes because it is a matter of perspective. It is a clean implementation if you understand pointers and recursion. In fact, here is the class I started to implement before I realized the issue noted below that is causing the next level or rewrite:
```
class HisNode
{
public:
string uuid;
vector<pair<NodeData*,ProcessDefinition*>> links;
};
/* Crude, prototype implementation of a general tree structure
interface created from processing history */
class GTree
{
public:
GTree(const ProcessingHistory& h);
~GTree();
void print_tree();
private:
multimap<string,NodeData> nodes;
/* Used in constructor - this is the recursive function that creates the tree.*/
void grow(const string rootuuid);
/* This could be a variety of containers, but I use a simple list -
all search start from the head of the list (root node)*/
list<HisNode> tree;
};
```
All the hard part of this algorithm is in the grow method which needs to be recursive. Basically the constructor needs to copy the nodes data and grow creates the string of pointers that define the Gtree object.

The revision I found necessary at this stage is that the current implementation has not just the multimap of NodeData but this ugly construct:

std::multimap<mspass::ProcessDefinition,std::string,AlgorithmCompare> process_data;

I created the monstrosity because I originally was thinking it was important to implement the following two methods defined in the current implementation:

/* These are a set of low level getters for minimal find operations
on history.  The intent is that more complex finds would use
combinations of calls to these getters to build more complete
reconstructions. */
/*! \brief Return a list of algorithms applied to produce this data.

The list of algorithms returned format is implementation dependent.
Here the unique id for an algorithm is an id string, but id strings
are not useful to humans.  Hence, the return here is a list of names
in the format algorithm:id where : is that character inserted
to produce a single string (e.g. "agc:1234").   Warning:  the
list is NOT in process order, but weak order defined for the multimap
container.   We don't include more complex algorithms to reduce the
complexity of this low level object.  */
list<std::string> algorithm_history() const;
/*! \brief Return uuids of all data handled by a given processing algorithm that
are parents of this object.

This method is an extended version of algorithm_history.   It returns a list
of uuids matching the algorithm id passed as an argument.  Note for
interactive data exploration a typical usage would be to call algoritym_history
to get ids and then call this method to get the uuids with which it is
associated.  For linear workflows the return will be equivalent to all
inputs passed through that algorithm.  For iterative algorithms the list
can be much longer as each pass will be post new uuids for the same
algorithm.

\param alg is the algorithm name to search for. (Note:  ignored in this
implementation but will make any application more readable.)
\param algid is the id string used to uniquely define a algorithm instance.

\return list of uuids handled by that instance of that algorithm.  Silently
returns an empty list if there is no match
*/
list<std::string> data_processed_by(const std::string alg,
  const std::string algid) const;

Those algorithms were more efficient if the indexing was over ProcessDefintion. What i realized was that created excess baggage in the interface that was much more easily handled outside of ProcessingHistory. In fact, what revision 2 will do is make those two method become C++ functions. We can just wrap them with pybind11 and call functions called "algorithm_history" and "data_processed_by". The only change would be a need to add an argument for the ProcessingHistory object to which the function be applied.

The final thing this brings me to is that the data held in ProcessDefinition can and should just be added to NodeData. Then the internal structure of ProcessingHistory is really simple: a single multimap keyed by uuid with NodeData entries. Then the model would be that to reconstruct the general tree that defines the processing history would be a separate class that is a fancier version of the GTree prototype above. What I plan is to make that change and then actually implement the crude GTree class for testing to make sure this really does work - my 100% maybe should be stated as 99% sure this will work since I don't actually have a working prototype. It is just I see no reason right now is shouldn't work.
I ran across the Graphviz package that has a "dot" language that might be a quick and easy solution for building a graphical representation of the history chain from a more robust GTree implementation. Graphviz is a heavily used package used, for example, in doxygen. One simple model would be a graph method in GTree that would use Graphviz to build a figure to visualize the processing chain.

pavlis commented 4 years ago

Thought I was home free on this implementation of ProcessingHistory, but accidentally uncovered a fundamental problem I'm not totally sure yet how to solve. I hope I don't have to go back to the drawing board, but I do have a possible solution I'll get to after I explain the problem I uncovered.

I was working on a test_history C++ program that would take ProcessingHistory data and build a general tree class specialized for this problem from the contents. I got to the point that I created a proper tree that I could examine with graphviz. I created trees that properly demonstrated the relations for three key methods: set_as_origin, new_map, and new_reduction. The problem surfaced when I naively tried to use clones on a subsequent step. i.e. I ran a different case of new_reduction where some of the inputs had components in common from the application of copy constructors to create a set of inputs for new_reduction. The woops this uncovered is that the copies rendered the multimap ambiguous because key-values got duplicated from clones carrying a common chain of node data. That created duplicate, ambiguous links to common records from the different copies.

The solution I think will solve this will require a bit more work by the new_reduction method. Since the problem is completely limited to new_reduction I believe the solution is to have it simply check for a duplicate before it adds a node to the nodes multimap. If that works as I think it will we will be fine and I'll be almost done with this. Hopefully, I will not uncover something else in the process.

This will also require some more complexity to the add_inputs method. I will have to do the same checks.

So, I'm relatively confident I can still make this work, but I wanted to put this update here for the record.

pavlis commented 4 years ago

For the record I think the idea noted in the last comment worked. That is, I added a small amount of code to new_reduction that eliminates duplicate node entries and at least with the tests I have running right now it seems to work.

I'm writing this to start a discussion on how we should plan to expose this to python? Writers will need to dump the history multimap somehow and we will need readers that are capable of retrieving and reconstucting a full history chain that (optionally) includes intermediate saves. I think we concluded it was probably best done with some custom C++ code that will interact with pybind11 to translate the multimap into something more pythonic that can be written and retrieved from Mongo. Do you have a specific idea in mind you could describe here?

The thing that has bogged me down on this testing exercise much more than fixing ProcessingHistory bugs is the design of the tests. I have really stuggled with the implementation of a "general tree" structure as described, for example, here. I wish I could find a source with a better approach than what I'm doing here with a recursive function. I read one source that echoed my experience with this. The source said recursion was dangerous because it was almost always challenging to impossible to understand how they work and fix bugs when they don't. I can certainly echo that right now. I have double jeopardy trying to understand how to make a graphical representation of the tree with graphviz as I think I'm getting artifacts purely from not representing the stucture correctly.

Do you have an experience that could help me out on this? Don't look at what I sent you during our last call as it isn't even close to right. I don't want to put this messy thing I have in the test suite as it could prove to be a maintenance headache. I found this simple algorithm that uses an std::queue that might be less confusing, but it seems kind of limited. That is, it seems to kind of build the tree structure manually, while what I need is a way to build it from the multimap - why I have this ugly recursive function.

Suggestions? I'm kind of drowning in this.