open-reaction-database / ord-schema

Schema for the Open Reaction Database
https://open-reaction-database.org
Apache License 2.0
95 stars 27 forks source link

Proposal to de-nest VesselType #556

Closed skearnes closed 3 years ago

skearnes commented 3 years ago

Let's refactor VesselType to be subfields of the Vessel message:

message Vessel {
  // Vessel type (flask, vial, well-plate, etc.).
  enum VesselType {
    UNSPECIFIED = 0;
    CUSTOM = 1;
    ROUND_BOTTOM_FLASK = 2;
    VIAL = 3;
    WELL_PLATE = 4;
    MICROWAVE_VIAL = 5;
    TUBE = 6;
    CONTINUOUS_STIRRED_TANK_REACTOR = 7;
    PACKED_BED_REACTOR = 8;
    NMR_TUBE = 9;
    PRESSURE_FLASK = 10;  // E.g., sealed tube
    PRESSURE_REACTOR = 11;  // High-pressure (e.g., Parr bomb reactor)
    ELECTROCHEMICAL_CELL = 12;
  }
  VesselType type = 1;
  // Include details such as "multi-neck" for ROUND_BOTTOM_FLASK,
  // "J. Young" for NMR_TUBE, etc.
  string details = 2;
  // Vessel material (glass, plastic, etc.).
  VesselMaterial material = 3;
  // How the vessel was prepared for the reaction.
  repeated VesselPreparation preparations = 4;
  // Vessel attachments (septum, cap, gas adapter, etc.).
  repeated VesselAttachment attachments = 5;
  // Size (volume) of the vessel.
  Volume volume = 6;
}

The extra level of nesting that requires VesselTypeEnum isn't necessary.

skearnes commented 3 years ago

Also, we should do the same thing for IlluminationType.

skearnes commented 3 years ago

Also, we should do the same thing for FlowType.

skearnes commented 3 years ago

And we should rename TubingMaterialType -> TubingType.

skearnes commented 3 years ago

Also, we should do the same thing for ElectrochemistryType.

michaelmaser commented 3 years ago

Should the same be done for similar nested subfields like VesselMaterial, VesselPreparation, and VesselAttachment? The final structure would look like this:

message Vessel {
  // Vessel type (flask, vial, well-plate, etc.).
  enum VesselType {
    UNSPECIFIED = 0;
    CUSTOM = 1;
    ROUND_BOTTOM_FLASK = 2;
    VIAL = 3;
    WELL_PLATE = 4;
    MICROWAVE_VIAL = 5;
    TUBE = 6;
    CONTINUOUS_STIRRED_TANK_REACTOR = 7;
    PACKED_BED_REACTOR = 8;
    NMR_TUBE = 9;
    PRESSURE_FLASK = 10;  // E.g., sealed tube
    PRESSURE_REACTOR = 11;  // High-pressure (e.g., Parr bomb reactor)
    ELECTROCHEMICAL_CELL = 12;
  }
  VesselType type = 1;
  // Vessel material (glass, plastic, etc.).
  // Include details such as "multi-neck" for ROUND_BOTTOM_FLASK,
  // "J. Young" for NMR_TUBE, etc.
  string details = 2;
  enum VesselMaterial {
    UNSPECIFIED = 0;
    CUSTOM = 1;
    GLASS = 2;
    POLYPROPYLENE = 3;
    PLASTIC = 4;
    METAL = 5;
    QUARTZ = 6;
  }
  VesselMaterial material = 3;
  // How the vessel was prepared for the reaction.
  enum VesselPreparation {
    UNSPECIFIED = 0;
    CUSTOM = 1;
    NONE = 2;
    OVEN_DRIED = 3;
    FLAME_DRIED = 4;
    EVACUATED_BACKFILLED = 5;
    PURGED = 6;
  }
  repeated VesselPreparation preparations = 4;
  // Vessel attachments (septum, cap, gas adapter, etc.).
  // Include attachment specifications in details, e.g., "rubber" for SEPTUM,
  // "Teflon" for CAP, "water jacket" for REFLUX_CONDENSER, etc.
  // Also include sealing details, e.g., "electrical tape", "parafilm", etc.
  enum VesselAttachment {
    UNSPECIFIED = 0;
    NONE = 1;
    CUSTOM = 2;
    SEPTUM = 3;
    CAP = 4;
    MAT = 5;  // e.g. a covering for a well plate.
    REFLUX_CONDENSER = 6;
    VENT_NEEDLE = 7;
    DEAN_STARK = 8;
    VACUUM_TUBE = 9;
    ADDITION_FUNNEL = 10;
    DRYING_TUBE = 11;
    ALUMINUM_FOIL = 12;
    THERMOCOUPLE = 13;
    BALLOON = 14;
    GAS_ADAPTER = 15;  // E.g. a ground-glass adapter with a gas line.
    PRESSURE_REGULATOR = 16;
    RELEASE_VALVE = 17;
  }
  repeated VesselAttachment attachments = 5;
  // Size (volume) of the vessel.
  Volume volume = 6;
}

Is this what you have in mind? Also, does anything need to be done to preserve the subfield numbering?

connorcoley commented 3 years ago

It's not possible to have multiple enums inside the same message with the same names AFAIK. The reason to keep the vessel information nested is because the VesselType, VesselMaterial, VesselAttachment would have conflicting UNSPECIFIED, NONE, and CUSTOMs

skearnes commented 3 years ago

I'm mostly reacting here to "{}TypeEnum", which I think is an anti-pattern. I also think most messages of this kind should have a top-level details field to capture general things, and refactoring e.g. VesselType to top-level type/details fields with a VesselType enum provides this. So I don't think we should do this for everything; just these anti-patterns. Connor is right that it wouldn't even be possible to de-nest everything with the non-prefixed naming scheme we've chosen to use.

michaelmaser commented 3 years ago

I see, understand now. Fixed that, should be in format Steve suggested. Implementing for other "{}TypeEnum"s

michaelmaser commented 3 years ago

What about, e.g., StirringMethodType?

skearnes commented 3 years ago

What about, e.g., StirringMethodType?

Yes, I think that one too; that will provide a top-level details field for StirringConditions.

@connorcoley hasn't said yet whether he likes this idea at all, though :)

michaelmaser commented 3 years ago

What about, e.g., StirringMethodType?

Yes, I think that one too; that will provide a top-level details field for StirringConditions.

The naming here can become slightly ambiguous/vague if treated similarly to the renaming of TubingMaterialType -> TubingType. E.g., that one and StirringMethodType -> StirringType aren't so bad, we sort of know what "tubing type" and "stirring type" mean. The analogous PressureControlType -> PressureType and TemperatureControlType -> TemperatureType are a little confusing as to what "pressure type" and "temperature type" mean, though. For now, I instead left as StirringMethod, PressureControl, and TemperatureControl enums, since their higher level message fields are method, control, and control, respectively. Does this make sense, or should they be converted to {}Type with renamed fields {}Type type?

skearnes commented 3 years ago

Yeah, I think those ones have names with justified complexity; TubingMaterialType seemed like an outlier to me, but I don't feel strongly if we want to keep "Material" in there.

michaelmaser commented 3 years ago

SGTM, I used TubingType since the field name is type anyway

michaelmaser commented 3 years ago

Is there a reason electrochemistry_type and flow_type are named such in their messages and not just type? Realized I converted them to type when de-nesting; should they be reverted to avoid breaking anything?

skearnes commented 3 years ago

I think it's fine to rename these to type for consistency with the other messages. As before, we'll need to save a copy of the original proto as reaction_old.proto and write a script to convert existing datasets---so it's ok to update the tag numbers here. Once we hit v1.0 we'll have to deprecate/reserve tag numbers instead.