materialsproject / emmet

Be a master builder of databases of material properties. Avoid the Kragle.
https://materialsproject.github.io/emmet/
Other
51 stars 63 forks source link

Q-Chem Bug: There should not be a check on supported basis sets and functionals #933

Closed Andrew-S-Rosen closed 3 months ago

Andrew-S-Rosen commented 5 months ago

Problem

If you try to build a Q-Chem task doc from a calculation that uses a functional or basis set beyond those that have been pre-tabulated, the task doc can't be made. This is extremely limiting. Tagging @rdguha1995.

In my instance, I was running a calculation with the def2-SVP basis set, and it crashed on parsing.

Proposed Solution

Here is the source of the issue:

    functional = [f for f in FUNCTIONALS if f.lower() == funct_lower]
    if not functional:
        raise ValueError(f"Unexpected functional {funct_lower}!")

    functional = functional[0]

    basis = [b for b in BASIS_SETS if b.lower() == basis_lower]
    if not basis:
        raise ValueError(f"Unexpected basis set {basis_lower}!")

Errors should not be raised here. There is no limit on what users might want to use for the level of theory. The list of functionals and basis sets are here. The functional and basis set should just be accepted from the user as-is and not be hard-coded.

Alternatives

No response

Andrew-S-Rosen commented 5 months ago

Note that a similar error will occur with solvation methods here.

rdguha1995 commented 5 months ago

Thanks for flagging this @Andrew-S-Rosen .

I just had a discussion with @espottesmith and not raising the ValueError might not be the best solution. As I mentioned yesterday, there are two paradigms which are merging here. @espottesmith 's foundational work which was geared towards adding Molecule TaskDocument's to MPCules and my current effort which attempts to cast a wider net. Eventually these codes will merge, and when my version of the code is used in emmet builders, we would definitely need to validate the functional/basis/solvent. Not all of them would be equal.

Having said that, we definitely need to expand the currently accepted list of functionals/basis-sets. There is no reason for def2-svp to be an invalid functional.

So I think the best option right now(as a first step) would be to agree on a list of say 25(choose your number) widely accepted/used functionals/basis/solvents which hopefully captures most QChem calculations.

Tagging in @rkingsbury and @samblau for their thoughts on this

Andrew-S-Rosen commented 5 months ago

I would strongly advise against this decision (tagging @espottesmith for discussion). If the emmet taskdoc is used outside of the Molecules App (like in atomate2), it would make Q-Chem inflexible for users. There is no hard-coded list that would be satisfactory. Who is to say what is appropriate? For instance, M06-L isn't a functional you all use, but it's one I use a lot. Or how about a random basis set? When new functionals or basis sets are added to Q-Chem, one would also have to manually update the list each time. Then there's the philosophical question of --- why does it even matter? The level of theory should just be the functional, basis set, and solvent accepted as-is. If a nonsensical value were provided, Q-Chem would not have ran in the first place.

If the goal is to only use this task document for MP purposes, then the ValueError could potentially make sense. However, if the task document is to be used elsewhere to obtain a consistent schema, which I imagine is the purpose, then it does not make sense.

espottesmith commented 5 months ago

The core problem here is that we need a solution which is reasonable both for building a standardized database for MP and for end users.

I hear you that no predefined list is going to capture all use cases of all users, but it's absolutely essential for database construction (to your "why does it matter", we need to be able to rank calculations to select the "best" one. If we encounter a functional that we haven't seen before, then how do we rank it?) And this isn't just true on the molecules side - the materials side of MP also has a predefined list of functionals, and it ranks these to select the best properties.

If we have to choose between being inflexible to users or not being able to validate/rank on the DB side, I'd heavily lean in favor of being inflexible. That said, I don't think that MP and atomate2 necessarily need to use the same schema with the same validation in place. After all, the emmet task doc, with validation, was built on the original atomate task doc, which allowed basically anything that you could throw at Q-Chem.

Andrew-S-Rosen commented 5 months ago

I will have to do some digging on the materials side because, to the best of my knowledge, the VASP task document does not do any MP-related validation in construction of the task doc itself (beyond checking to make sure the calculation actually completed). This ensures that atomate2 is fully flexible. I believe there is a separate ValdationDoc.

I do agree that there are two objectives at odds with one another here. At some point (https://github.com/materialsproject/foundation/pull/2), it was decided that task documents would start out in atomate2 and then, when they mature, they could eventually be in emmet (which atomate2 calls as a dependency). Of course, we kind of skipped that step since we went atomate1 --> emmet, but regardless, that philosophy can naturally introduce some conflicts, like that discussed here.

Is there any reason why the ranking needs to take place during construction of the task doc? To me, it seems like it would make more sense to rank the functional/basis set/solvent in whatever code is calling the task doc to curate data for MP (I imagine that would be in web)?

In any case, for me, it doesn't really matter, but this is ultimately going to be an issue for atomate2 if not appropriately addressed. Users need to be able to run whatever method that best suits their system, no matter how obscure they are to the MP team.

espottesmith commented 5 months ago

I'm also not sure at what point the validation happens, and I myself don't use atomate2 (I've never really been convinced that it should exist), but my code was heavily inspired by what was already in place in emmet.

Ranking does not take place during the construction of the task doc. It occurs in emmet (not web), starting from the MoleculesAssociationBuilder. But regardless of where the ranking takes place, if any task doc can't be ranked (at the point of building), then that raises problems. It's totally possible for us to not validate at the point of task creation and only later on, but really that's just passing the buck.

I know of several users (including @samblau and @rdguha1995) who have used builders to process collections of task documents, even if the output collections weren't intended for MP. If we don't validate the task docs but do validate when building, the outcome is the same - the user's calculations can't be used as expected.

For your particular use case, maybe this is acceptable, and maybe we let task docs be constructed with no validation and then validate only later on. But I don't think that resolves the larger issue here.

Andrew-S-Rosen commented 5 months ago

Maybe the solution in that scenario is to allow for any functional/basis set/solvent, but if it is "unknown" to emmet, it would not have an associated ranking and when that is needed in MoleculesAssociationBuilder, such calculations would have the lowest priority by default. Does this kind of scheme seem logical, or does that not address your major concern of wanting an error to be raised for MP purposes? If the latter, maybe the error should be raised somewhere else, like in the MoleculesAssociationBuilder?

espottesmith commented 5 months ago

@rdguha1995 brought up the "lowest possible priority" option. That's acceptable from the MP perspective, but not great from the user perspective. I actually think it's worse than failing validation, because it's basically failing silently. Let's say you do some calculations with some fancy functional (wB97M(2), whatever) that wasn't part of our ranking scheme. A user who isn't intimately familiar with emmet would be reasonably confused why their calculations are never chosen when building property documents, even when they're clearly the best tasks available.

Having the error raised in the association builder - or just not having tasks allowed during building - is more acceptable, but as I just said, there's still a problem where users can't use the builders because their tasks aren't part of the allowed sets. The only benefit of this is that the task document exists in one case, and doesn't in the other. That's something, but this is still a problem.

rdguha1995 commented 5 months ago

This was my proposed solution as well @Andrew-S-Rosen , but then the users would have to make peace with the fact that if they intend to add their calculations to the DB (it would not show up).

Something which might be an option (again not ideal), is just to have two separated TaskDoc's. Something like a DBTaskDoc and a TaskDoc. Basically what is available now, but instead of trying to merge it, we formalize the distinction instead.

If you want your calculations to go in MP use the DBTaskDoc. if not, TaskDoc ?

espottesmith commented 5 months ago

Having two separate classes seems like it'd add a decent amount of bloat, but I think having "validate" as a flag during task doc creation could be reasonable. If True, then you check things like level of theory; if False, you don't care. Either way, it could populate a "validated" field which the builders can check/query against.

I have a sinking suspicion that this would end up with piles of unvalidated tasks that people eventually want to add to MP, but I guess that's a bridge to burn another day.

Andrew-S-Rosen commented 5 months ago

In the short-term, I do think a validate kwarg would probably be the least amount of burden for both the code development side and the user side. It could even be set to True by default (existing behavior maintained), with the expectation that any workflow package (e.g. atomate2) would explicitly turn this off. It would be nice to know how this is all handled for VASP since I know the task doc is fully flexible for atomate2 users (while still being sufficiently validated by the time it gets its way to MP), but that's a can of worms probably not worth sifting through...

rkingsbury commented 5 months ago

Certainly a worthy discussion! My $0.02 is the following:

The decision was taken to proceed with option 1. emmet will be used to house documents that are used in the Materials Project website and core workflows. Documents that provide from_files functions will ensure that all imports are performed inside the function itself to prevent dependency bloat.

rdguha1995 commented 5 months ago

Thanks for the inputs everyone! @rkingsbury thanks for tagging the ValidationDoc class. In my opinion, that's the long term solution we should strive for. For now I will go for the low hanging fruit and add a validate kwarg. I think as soon as we make the decision to transition from the TaskDocument class designed by @espottesmith to the TaskDoc class written by your truly, we should add in a ValidationDoc class for qchem as well

rkingsbury commented 5 months ago

Sounds like a great plan to me (near term and long term) @rdguha1995 . That way the "valid" calc types can be defined by external users, either by excluding the ValidationBuilder from their pipeline or by customizing its settings.

rdguha1995 commented 3 months ago

This has now been addressed in PR #970. A validate_lot flag has been added to the TaskDoc.from_directory function. It defaults to True. But if it turned to False, the TaskDoc is created with this warning :- UserWarning: User has turned the validate flag off.This can have downstream effects if the chosen functional and basis is not in the available sets of MP employed functionals and the user wants to include the TaskDoc in the MP infrastructure.Users should ignore this warning if their objective is just to create TaskDocs

The warning was not completely necessary but I just added it for good measure.