Stream validation - Githubissues

Implemented validation of required Stream contents for all Process subclasses
Fixed a few subtle bugs uncovered by Stream validation
Added a new Model attribute to override stream validation to keep test XML model files simple
Moved some Process updates in gas_lifting_field into proper Aggregators so XML merging works correctly
Added some debugging methods
Minor code cleanup (better comments, declaring instance vars)
Fixed search for Aggregators surrounding Process classes specified in csv2xml

I think my biggest concern is that the validation requires specific definitions inside <Contains> elements in the XML. One could create an XML that is valid from a structural perspective, but because a stream contains "storage oil" instead of "oil for storage" the validation would fail.

The XML must correspond to the Python code in 2 ways: the names / classes of Process subclasses, and the generalized contents of Streams. You could have "structurally correct" XML that fails in one or both of these ways. I don't see an easy way to decouple these. The validation subsystem is the next best thing: automate detection of these failures.

This also raised some confusion about the name attribute on Streams and the Contains elements. It seems like the names are used only for the StreamRef, but I could be mistaken.

I believe that's correct. And for reporting / GUI display.

I can also see that there are repeated strings in the Contains elements, which I can see being used to aggregate across multiple Stream instances. But it also implies there are specific types of streams that must carry specific contents and that certain processes must be connected to specific Stream types.

As I've probably mentioned, I've long advocated for avoiding stream content names that reference specific processes, as this somewhat defeats the purpose of the <Contains> element. Process-specific names aren't much better than traversing the Stream to identify the Process on the opposite side, which would be terribly brittle.

I think in the short-term this validation seems like a solid approach, but I think it's worth considering an update to Stream handling where stream contents are narrowed to a limited set of substances (i.e. gas, oil, CH4, etc...) and processes have more knowledge of input/output streams.

I agree that a more generalized (and perhaps a known, limited) set of stream contents makes sense. We could define the set of these in the code and allow extensions via the configuration file. This would improve validation, and probably simplify some of the run methods.

In this way, a process could declare that it requires a certain number or range if input/output streams with certain components/contents. VFPartition is maybe a good example where the input specification appears to be a single stream, while the opgee.xml template has multiple streams ending at VFPartition both containing gas for partition. If streams were a very generic connection between processes e.g.:
<Stream src="BitumenMining" dst="VFPartition" />
<Stream src="Separation" dst="VFPartition" />
Then BitumenMining and Separation could set the contents/components on their output streams to then be validated by VFPartition. Furthermore, VFPartition could specify that it requires 1 to 3 input streams containing gas while BitumenMining could specify it has 1 or 2 oil output streams and 0 or 1 gas output stream.

I see a few issues with this. For one, the run method of the source process needs to be able to "find" it's output streams to know which one to emit substances to. If the contents are set by, and thus vary with, the destination process, this becomes unknowable.

Also, I'm not sure it's necessary to identify the number of input streams containing a substance. If a process can handle multiple, it can call combine_streams, regardless of the number, so there's no benefit I see to requiring a specific number. In the case of VFPartition, I believe it's only important that at least one such stream is found.

This admittedly would lose the ability for the field/model to quickly grab all the gas for partition streams in the same way, but I think that could be solved with something similar to the Aggregator or another kind of categorization mechanism.

Where is this "grab all the X streams" used? It's not clear to me that this is useful.

The main benefits I see are simplifying the XML and decoupling its details from the underlying model and allowing for class level declarations in processes where we could leverage existing validation/schema libraries (e.g. Pydantic) for runtime checks.

the run method of the source process needs to be able to "find" it's output streams to know which one to emit substances to. If the contents are set by, and thus vary with, the destination process, this becomes unknowable.

I may not have been clear on this point, but I was thinking that contents would be set by the source processes. The only thing that would vary is how many, so if a process could have 0-1 "oil" or 2-Inf "gas" output streams, it could handle the different configurations explicitly.

One drawback is if it needs to handle 2 streams with the same contents differently (e.g. CrudeOilDewatering has 2 oil outputs for dilution and upgrading). In the current configuration, this feels like a proxy for one process behaving differently based on its adjacent processes. If there are constraints on source and destination processes for a given stream, it seems reasonable to include that in an input/output stream specification and have special handling based on the process at the other end of the stream.

Where is this "grab all the X streams" used? It's not clear to me that this is useful.

I don't see this used currently, it was more in response to the documentation for the <Contains> tag.

allows processes to find different input streams without reference to any specific process name.

It also allows any object to find streams based on contents, so I wasn't sure if that was an intended use case. Even so, it seems like there are constraints on which processes can connect (and in what direction, perhaps?). This is currently reflected in the <Contains> strings, and I'd advocate for defining those constraints at the process level either by valid neighbor process classes or (if we really want maximum flexibility) by interface (e.g. if a given process is the src the dst process must have a certain input specification).

Yeah, but "any object" doesn't have a need to find streams. I think it's relevant only to other Process subclasses.

I'd advocate for defining those constraints at the process level either by valid neighbor process classes or (if we really want maximum flexibility) by interface (e.g. if a given process is the src the dst process must have a certain input specification).

I've tried to avoid any such constraints. Think about a user defining a novel Process subclass and wanting to wire it into the Field definition. The constraints above prevent this without code modifications. The whole point of is to abstract this in a way that avoids code modifications to accommodate extensions.

On Oct 29, 2024, at 12:15 PM, Michael Barlow @.***> wrote:

the run method of the source process needs to be able to "find" it's output streams to know which one to emit substances to. If the contents are set by, and thus vary with, the destination process, this becomes unknowable. I may not have been clear on this point, but I was thinking that contents would be set by the source processes. The only thing that would vary is how many, so if a process could have 0-1 "oil" or 2-Inf "gas" output streams, it could handle the different configurations explicitly. One drawback is if it needs to handle 2 streams with the same contents differently (e.g. CrudeOilDewatering has 2 oil outputs for dilution and upgrading). In the current configuration, this feels like a proxy for one process behaving differently based on its adjacent processes. If there are constraints on source and destination processes for a given stream, it seems reasonable to include that in an input/output stream specification and have special handling based on the process at the other end of the stream.

— Reply to this email directly, view it on GitHub https://github.com/msmasnadi/OPGEEv4/pull/27#issuecomment-2445125408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA25CHSN4QXUMKNSKEDAQ5LZ57NEPAVCNFSM6AAAAABPTD7XF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBVGEZDKNBQHA. You are receiving this because you authored the thread.

msmasnadi / OPGEEv4

Stream validation #27