Database does not utilize path to user provided YAML file

wigging commented 1 year ago

Description

The dbpath parameter for the Database class is supposed to let the user define the path to a YAML file. But this path is ignored and watertap uses the database YAML files provided in watertap which are located at watertap/data/techno_economic/.

from pyomo.environ import ConcreteModel
from watertap.core.wt_database import Database

model = ConcreteModel()
model.db = Database(dbpath=".")

Expected Behavior

When the user provides the path to a YAML file with Database(dbpath="path/to/myfile.yaml"), then the model should use the parameters defined in that file.

Steps to Reproduce

Parameters for a zero-order model are defined in the YAML file shown below.

default:
  energy_electric_flow_vol_inlet:
    value: 0.050718512
    units: kWh/m^3
  capital_cost:
    basis: flow_vol
    cost_factor: None
    reference_state:
      value: 4732.0
      units: m^3/hr
    capital_a_parameter:
      value: 12.17829669e6
      units: USD_2014
    capital_b_parameter:
      value: 0.5862
      units: dimensionless
  recovery_frac_mass_H2O:
    value: 0.99
    units: dimensionless
    reference:
  default_removal_frac_mass_comp:
    value: 0
    units: dimensionless
  removal_frac_mass_comp:
    nonvolatile_toc:
      value: 0.2
      units: dimensionless
      constituent_longform: Nonvolatile TOC
    toc:
      value: 0.2
      units: dimensionless
      constituent_longform: Total Organic Carbon (TOC)
    tss:
      value: 0.97
      units: dimensionless
      constituent_longform: Total Suspended Solids (TSS)

The YAML file is located in the same directory as the example shown below.

from pyomo.environ import ConcreteModel
from idaes.core import FlowsheetBlock
from idaes.core.solvers import get_solver
from watertap.core.wt_database import Database
from watertap.core.zero_order_properties import WaterParameterBlock
from watertap.unit_models.zero_order import DualMediaFiltrationZO

def main():

    model = ConcreteModel()
    model.db = Database(dbpath=".")

    model.fs = FlowsheetBlock(dynamic=False)
    model.fs.params = WaterParameterBlock(solute_list=["nonvolatile_toc", "toc", "tss"])

    model.fs.unit = DualMediaFiltrationZO(property_package=model.fs.params, database=model.db)
    model.fs.unit.inlet.flow_mass_comp[0, "H2O"].fix(10)
    model.fs.unit.inlet.flow_mass_comp[0, "nonvolatile_toc"].fix(1)
    model.fs.unit.inlet.flow_mass_comp[0, "toc"].fix(1)
    model.fs.unit.inlet.flow_mass_comp[0, "tss"].fix(1)

    _ = model.db.get_unit_operation_parameters("dual_media_filter")
    model.fs.unit.load_parameters_from_database()

    solver = get_solver()
    solver.solve(model)

    model.fs.unit.report()

if __name__ == "__main__":
    main()

Running this example gives the error shown here.

$ python dual_media_filter.py

Traceback (most recent call last):
  File "/Users/g72/Desktop/ORNL/WaterTAP/watertap/watertap/core/wt_database.py", line 201, in _get_technology
    with open(os.path.join(self._dbpath, technology + ".yaml"), "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: './dual_media_filtration.yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dual_media_filter.py", line 49, in <module>
    main()
  File "dual_media_filter.py", line 40, in main
    model.fs.unit.load_parameters_from_database()
  File "/Users/g72/Desktop/ORNL/WaterTAP/watertap/watertap/core/zero_order_base.py", line 192, in load_parameters_from_database
    pdict = self.config.database.get_unit_operation_parameters(
  File "/Users/g72/Desktop/ORNL/WaterTAP/watertap/watertap/core/wt_database.py", line 141, in get_unit_operation_parameters
    params = self._get_technology(technology)
  File "/Users/g72/Desktop/ORNL/WaterTAP/watertap/watertap/core/wt_database.py", line 205, in _get_technology
    raise KeyError(f"Could not find entry for {technology} in database.")
KeyError: 'Could not find entry for dual_media_filtration in database.'

To fix the example, you must set the private variable _tech_type to the name of the YAML file. This is shown in the edited example below.

from pyomo.environ import ConcreteModel
from idaes.core import FlowsheetBlock
from idaes.core.solvers import get_solver
from watertap.core.wt_database import Database
from watertap.core.zero_order_properties import WaterParameterBlock
from watertap.unit_models.zero_order import DualMediaFiltrationZO

def main():

    model = ConcreteModel()
    model.db = Database(dbpath=".")

    model.fs = FlowsheetBlock(dynamic=False)
    model.fs.params = WaterParameterBlock(solute_list=["nonvolatile_toc", "toc", "tss"])

    model.fs.unit = DualMediaFiltrationZO(property_package=model.fs.params, database=model.db)
    model.fs.unit.inlet.flow_mass_comp[0, "H2O"].fix(10)
    model.fs.unit.inlet.flow_mass_comp[0, "nonvolatile_toc"].fix(1)
    model.fs.unit.inlet.flow_mass_comp[0, "toc"].fix(1)
    model.fs.unit.inlet.flow_mass_comp[0, "tss"].fix(1)

    # This example only works if the following line is uncommented
    model.fs.unit._tech_type = "dual_media_filter"

    _ = model.db.get_unit_operation_parameters("dual_media_filter")
    model.fs.unit.load_parameters_from_database()

    solver = get_solver()
    solver.solve(model)

    model.fs.unit.report()

if __name__ == "__main__":
    main()

Output from the edited example is here.

$ python dual_media_filter.py

====================================================================================
Unit : fs.unit                                                             Time: 0.0
------------------------------------------------------------------------------------
    Unit Performance

    Variables: 

    Key                              : Value      : Units         : Fixed : Bounds
                  Electricity Demand :     2373.6 :          watt : False : (0, None)
               Electricity Intensity : 1.8259e+05 :        pascal :  True : (None, None)
    Solute Removal [nonvolatile_toc] :    0.20000 : dimensionless :  True : (0, None)
                Solute Removal [toc] :    0.20000 : dimensionless :  True : (0, None)
                Solute Removal [tss] :    0.97000 : dimensionless :  True : (0, None)
                      Water Recovery :    0.99000 : dimensionless :  True : (0.0, 1.0000001)

------------------------------------------------------------------------------------
    Stream Table
                                               Units            Inlet   Treated  Byproduct
    Volumetric Flowrate                   meter ** 3 / second 0.013000 0.011530 0.0014700 
    Mass Concentration H2O              kilogram / meter ** 3   769.23   858.63    68.027 
    Mass Concentration nonvolatile_toc  kilogram / meter ** 3   76.923   69.384    136.05 
    Mass Concentration toc              kilogram / meter ** 3   76.923   69.384    136.05 
    Mass Concentration tss              kilogram / meter ** 3   76.923   2.6019    659.86 
====================================================================================

Environment

OS: macOS 12.6.2
Python: v3.8.15
WaterTAP: v0.8.dev0

Anything Else?

I think the issue is caused by _tech_type. It seems to be hard coded to use the files located in watertap/data/techno_economic/ directory. It is not being defined properly when dbpath is given for the Database().

TimBartholomew commented 1 year ago

@adam-a-a @bknueven it seems that the zero order models cannot use a user provided yaml and it always has to be the one in the database, can you confirm that?

wigging commented 1 year ago

Looking at this some more. From my example, only one of these lines are needed:

 _ = model.db.get_unit_operation_parameters("dual_media_filter")
 model.fs.unit.load_parameters_from_database()

The model.db.get_unit_operation_parameters() is working correctly and returns a dictionary containing the parameters in the local YAML file. But in my example, I don't use that dictionary. How would I pass that dictionary of parameters to the model or solver?

When model.fs.unit.load_parameters_from_database() is called, it wants to use a YAML file in watertap/data/technoeconomic/ which doesn't work because there is no such file with that technology name. So this method only applies to models using the default database in watertap.

wigging commented 1 year ago

I looked more at the model.fs.unit.load_parameters_from_database() method. Which is linked to below. The problem with that method is that when dbpath is defined, it doesn't change the name of the self._tech_type. If this method was changed to something like load_parameters_from_database(techtype="dual_media_filter") then inside the function do self._tech_type = techtype. This allows the parameters from dual_media_filter.yaml in my example to be loaded for the model. Basically, changing the path with model.db = Database(dbpath=".") just changes the path (directory where the file is located). But the name of the YAML file needs to be provided too.

https://github.com/watertap-org/watertap/blob/d7393b178456dc6e7af16fe918490768c1d8200d/watertap/core/zero_order_base.py#L171

bknueven commented 1 year ago

So, the issue is not that the dbpath is not picked up, it's that each zero-order unit model is associated with a particular name -- in this case DualMediaFiltrationZO is associated with its _tech_type which is dual_media_filtration.

If you re-name your yaml file dual_media_filtration.yaml you can use your first non-working example. So you can specify your own yaml file (or set of yaml files) in a separate directory, but it has to have a specific name for each unit model.

andrewlee94 commented 1 year ago

I believe that was how it was intended to work, and that was the purpose of _tech_type.

wigging commented 1 year ago

So pretend I'm a user of watertap (not a developer), and I have no knowledge of the inner workings of watertap. I do not know what _tech_type is. I have no knowledge about the default YAML files located in watertap/data/techno_economic/. As a user, how do I provide my own YAML file for the zero-order model? Or do you not expect users to provide their own YAML file?

bknueven commented 1 year ago

For (nearly?) all of the ZO units, the file name is the snake-case equivalent of the class name, with "ZO" dropped. The stack trace above showed that it was looking for the file './dual_media_filtration.yaml' in the specified directory, which the user can provide.

So the design question is: should we allow custom names within the user-provided directory? Probably, but I'm not convinced the answer is to stand up a whole Database object for each unit model. What we should maybe do instead is allow the user to specify the name of the yaml file as option for the zero-order unit model.

But users can definitely use their own YAML file so long as they follow the naming conventions in watertap/data/techno_economic. Which I'm not saying is ideal -- we've just not had reason to revisit that particular decision.

wigging commented 1 year ago

That is definitely not ideal and not a good user experience. A user should be able to provide the path to their file where the path includes the file name. Where is it documented that the file name must be the snake-case equivalent to the class name?

bknueven commented 1 year ago

First, I just want to point out the existing usage of the Database class, here: https://github.com/watertap-org/watertap/blob/main/watertap/examples/flowsheets/case_studies/wastewater_resource_recovery/GLSD_anaerobic_digester/GLSD_anaerobic_digestion.py

It's main job is to tie together YAML files and our lightweight zero-order unit models, specifying a single Database object per flowsheet. Based on whatever directory the Database is pointed to, an associated YAML file is loaded for each unit model through the _tech_type parameter. The Database can also be used to manage the properties of the source water.

I think this issue points out two shortcomings with the existing code:

It is not easy for the user to create / access their own set of YAML files (e.g., their own copy of the WaterTAP ZO "Database")
The mapping from a ZO UnitModel to its associated YAML file is opaque to the user.

For shortcoming (1), I propose we create a function/method (maybe as a staticmethod on the Database object) which copies the files in watertap/data/techno_economic to a location specified in the first augment. For real ease-of-use we could create an associated command which gets installed with WaterTAP to wrap this function. Then a user can easily point the Database object to the new directory, modify the associated YAML files, and produce new results for a given flowsheet.

For shortcoming (2), I propose we create a new config argument for the ZO base class called yaml_file or something similar, where the user can specify the location of the YAML file for a specific unit. Instead of the database inferring the name of the YAML file from the _tech_type, the user could manually specify it. This has the added advantage that the user could create different specifications for the same ZO unit in different parts of the water treatment train. The Database object would still be responsible for loading the YAML file and putting its parameters into the unit model. For consistency we should probably also allow user to specify the location and name of water_sources.yaml, e.g.:

model = ConcreteModel()
model.db = Database(
    dbpath = "my/dbpath", # this is the default location where YAML files get looked up based on `_tech_type`
                          # if the user does not specify a `yaml_file` for the unit, and where the `Database` looks
                          # by  default for `water_sources.yaml`.
    # if this is specified, overwrites reading water sources from `dbpath / "water_sources.yaml"`
    water_sources_yaml = "my/other/path/my_water_sources.yaml", 
)

model.fs.params = WaterParameterBlock(solute_list=["nonvolatile_toc", "toc", "tss"])
model.fs.dmf = DualMediaFiltrationZO(
    property_package=model.fs.params,
    database=model.db,
    # overwrite the default of "my/dbpath/dual_media_filtration.yml"
    yaml_file="./myparams/dual_media_filter.yaml", 
)
 m.fs.coag_and_floc = CoagulationFlocculationZO(
        property_package=m.fs.prop,
        database=m.db,
        # look up using the default location "my/dbpath/coag_and_floc.yaml"
    )

Finally, it is obvious that this part of the code needs better documentation, in addition to these new features. Not sure where priorities lie. @wigging @TimBartholomew @kurbansitterley @andrewlee94 thoughts?

wigging commented 1 year ago

If a user provides their own YAML file, then why would they need to create a Database() instance too? For example, why would a user need to provide a database for this if the user provides the yaml file:

model.fs.dmf = DualMediaFiltrationZO(
    property_package=model.fs.params,
    database=model.db,
    yaml_file="./myparams/dual_media_filter.yaml", 
)

Seems like something like the following is all that is needed if a user provides the yaml file:

model.fs.dmf = DualMediaFiltrationZO(
    property_package=model.fs.params,
    yaml_file="./myparams/dual_media_filter.yaml", 
)

bknueven commented 1 year ago

If a user provides their own YAML file, then why would they need to create a Database() instance too? For example, why would a user need to provide a database for this if the user provides the yaml file:
model.fs.dmf = DualMediaFiltrationZO(
    property_package=model.fs.params,
    database=model.db,
    yaml_file="./myparams/dual_media_filter.yaml", 
)
Seems like something like the following is all that is needed if a user provides the yaml file:
model.fs.dmf = DualMediaFiltrationZO(
    property_package=model.fs.params,
    yaml_file="./myparams/dual_media_filter.yaml", 
)

I don't necessarily disagree -- but this would require moving methods out of the database and into the zero order base class and would be a larger reorganization effort. I imagine most of the time a user will want to create a Database class so they don't need to specify a custom YAML file for every single unit in the flowsheet and/or to manage the source water definition.

kurbansitterley commented 1 year ago

First, I just want to point out the existing usage of the Database class, here: https://github.com/watertap-org/watertap/blob/main/watertap/examples/flowsheets/case_studies/wastewater_resource_recovery/GLSD_anaerobic_digester/GLSD_anaerobic_digestion.py

It's main job is to tie together YAML files and our lightweight zero-order unit models, specifying a single Database object per flowsheet. Based on whatever directory the Database is pointed to, an associated YAML file is loaded for each unit model through the _tech_type parameter. The Database can also be used to manage the properties of the source water.

I think this issue points out two shortcomings with the existing code:

It is not easy for the user to create / access their own set of YAML files (e.g., their own copy of the WaterTAP ZO "Database")

The mapping from a ZO UnitModel to its associated YAML file is opaque to the user.

For shortcoming (1), I propose we create a function/method (maybe as a staticmethod on the Database object) which copies the files in watertap/data/techno_economic to a location specified in the first augment. For real ease-of-use we could create an associated command which gets installed with WaterTAP to wrap this function. Then a user can easily point the Database object to the new directory, modify the associated YAML files, and produce new results for a given flowsheet.

For shortcoming (2), I propose we create a new config argument for the ZO base class called yaml_file or something similar, where the user can specify the location of the YAML file for a specific unit. Instead of the database inferring the name of the YAML file from the _tech_type, the user could manually specify it. This has the added advantage that the user could create different specifications for the same ZO unit in different parts of the water treatment train. The Database object would still be responsible for loading the YAML file and putting its parameters into the unit model. For consistency we should probably also allow user to specify the location and name of water_sources.yaml, e.g.:

I created a separateDatabase class for the SETO project (thinking we might need different functionality from WaterTAP's- right now it is identical) and had a lot of similar thoughts @wigging expressed in this thread- essentially that requiring Database for ZO models makes using the ZO models more cumbersome than is necessary.

I like the idea of making the yaml a config argument as @bknueven suggested. I am not really a SW developer though, so take with a grain of salt.

adam-a-a commented 1 year ago

There's another aspect here. The only reason we had those YAML files thrown together in the first place was because we thought it'd be nice to have for users who wanted to blindly load in parameters for assumed-performance models and play with those models relatively easily. Following that, our thought process then became--if someone doesn't want to use those default values that we are providing in YAML files, they could manually fix those values themselves. I would argue that if a user is advanced enough to create their own YAML file properly, they are advanced enough to fix the values themselves rather than depend on setting up custom YAML files. I would think if a user tried creating a YAML file from scratch, they'd run into trouble half the time if they didn't know what they were doing (e.g., most unit models require recovery and removal efficiency to be included in the YAML, otherwise an error occurs).

I like the idea of not having to specify a database path for each unit model, but we also should think about the repercussions of allowing more freedom. I think that for the most part, ZO unit models depend on the get_unit_operation_parameters method from a Database instance. We could do something like

if blk.config.database is not None and blk.config.yaml_file is None:
     parameter_dict = blk.unit_model.config.database.get_unit_operation_parameters(
            blk.unit_model._tech_type, subtype=blk.unit_model.config.process_subtype)
elif blk.config.database is None and blk.config.yaml_file is not None:
    # do some work to get parameter dict from custom yaml file
#handle other possibilities regarding database and yaml_file config options

Changes would need to be made in zero_order_base.py as well as zero_order_costing.py, which I just realized was modified by a recently merged PR (#911) that separates cost methods into ~ 50 files. Each file has a line with

parameter_dict = blk.unit_model.config.database.get_unit_operation_parameters(
            blk.unit_model._tech_type, subtype=blk.unit_model.config.process_subtype)

Actually, it seems like even if were to achieve the desired changes regarding inclusion of a custom yaml file path and exclusion of a Database instance for unit models, we might still run into trouble if the user wanted to use the zero-order costing package.

My hunch is that in the absence of a substantial amount of effort, deviating from our existing workflow in an attempt to allow users to bring in their own custom YAML files in a cleaner, more sensible manner will still result in more complexities that, at the end of the day, would require the user to know intricacies about WaterTAP, which is what we're trying to avoid in the first place. So, I am not completely sure that enabling this capability (in the suggested way) will get us where we really want to be--which is to make life more convenient for the user.

bknueven commented 1 year ago

Since I've already put the thought into how to implement a few of these things, I might as well put it down.

Enabling user-specified YAML with the database

Example:

model.fs.dmf = DualMediaFiltrationZO(
    property_package=model.fs.params,
    database=model.db,
    yaml_file="./myparams/dual_media_filter.yaml", 
)

I think this is straightforward. We add a config argument to the ZeroOrderBase class for yaml_file, and we modify this function in the database to use the custom YAML file if specified on the unit model. https://github.com/watertap-org/watertap/blob/d370d50775bd193c929944ac37cc8781c7f31721/watertap/core/wt_database.py#L194-L211

Enabling user-specified YAML without the database

Example:

model.fs.dmf = DualMediaFiltrationZO(
    property_package=model.fs.params,
    yaml_file="./myparams/dual_media_filter.yaml", 
)

I think we can avoid the difficult changes @adam-a-a hinted at above by moving the method get_unit_operation_parameters (and its helpers) to ZeroOrderBase. This would require a bit of reworking of the method itself, and a find-and-replace operation, but I don't think we'd need to change much other code. We might even realize after doing this that we don't need the Database object at all (which would be a much bigger change).

My two cents

I think we should just execute on my option (1) above -- make it easier for a user to set their own parameters by creating a utility function to copy the YAML files in watertap/data/techno_economic to a user-specified location. We primarily focus our efforts on documentation of the workflow.

wigging commented 1 year ago

We might even realize after doing this that we don't need the Database object at all (which would be a much bigger change).

Yes, that's what I said previously. I don't understand the need for the database. The entire watertap/data/ could be removed if the values are set in the zero-order unit model class itself.

wigging commented 1 year ago

I didn't realize I was opening a can of worms with this issue and the associated pull request. If I'm the only one that cares about loading a YAML file, then just close this issue and don't bother with a fix. I can hack my way around the watertap API to use my own YAML file without making any changes to the existing code. From previous comments, the zero-order models seem to be an afterthought for watertap and were just carried over from an earlier version of the project.

adam-a-a commented 1 year ago

We can probably vote on a resolution for this. I think your issue and PR raised a valid point that should be addressed in some manner. At a minimum, we should add documentation to explain the workflow of adding a yaml, a ZO unit model, etc.

I also think going with @bknueven 's option 1 would be a step in the right direction. Even though the inclusion of database=model.db would be unnecessary and maybe even annoying when specifying a yaml file, I think it's something that we can deal with in the interim.

So I would vote to (1) add the yaml_file argument in for unit models and for water sources and (2) add documentation that describes the whole workflow of creating a yaml file for a unit model, where to typically place it, adding subtypes to existing yaml files, specifying a path to your own yaml, building a ZO unit model from scratch, etc.

EDIT: It sounds like (2) should primarily be a how-to that links to sections of the technical reference and maybe even background.

wigging commented 1 year ago

I'm going to close this issue based on the discussion during Wednesday's meeting. Instead of restructuring the watertap code, I'll submit a pull request to improve the zero-order model documentation to make it clear how to properly use a zero-order model. We can discuss options for improving the zero-order models at a later date.

ksbeattie commented 1 year ago

Replaced by #921

watertap-org / watertap