openmsi / openmsimodel

OpenMSIModel uses the GEMD (Graphical Expression of Material Data) format to interact with generalized laboratory, analysis, and computational materials data.
1 stars 3 forks source link

Generating GEMD Model backbones #5

Open arachid1 opened 4 months ago

arachid1 commented 4 months ago

Problem Statement:

Materials science workflows often involve complex sequences of ingredients, processes, and measurements. Capturing these workflows in a standardized format such as GEMD can facilitate collaboration and reproducibility. However, creating GEMD models manually can be time-consuming and error-prone. It also has difficulties capturing a file or folder structure, often the most critical assets of a scientist, due to inadequate naming conventions or structures.

From another perspective, we could ask:

Could we build a GEMD model backbone if we had answers to these questions:

1. What is your sample name (PID) and give me one that encodes a, b, c... about the basics (this is how we build the folders or other structure)
2. What is your workflow for making the sample? (ingredients and processing details; sample creation for modeled data; purchased samples info).
3. What is your workflow for making or purchasing a sample to the characterizations you do (this is how the sample creation links to the beginning of a characterization history)
4. What happens to your sample when you are done? Do you hand it off to someone else? (this is how we know what to link to from multiple investigators to fill out the whole GEMD)
5. Can you do something simple to link new data to the GEMD backbone created by the above questions? (this is how we let the built GEMD backbone get later additions from things not known in the beginning definition of the workflow)
arachid1 commented 4 months ago

We propose a text-based solution using XML to generate GEMD graph models from user-defined workflows.

Proposed Solution:

XML Schema Design: Define an XML schema that represents the structure of GEMD graph models, including elements for ingredients, processes, materials, and measurements. Use XML attributes and elements to capture details such as IDs, names, and properties.

Workflow Specification: Allow users to define their workflows in XML format, specifying the sequence of ingredients, processes, materials, and measurements. Use XML elements to represent each step in the workflow.

Handling Many-to-One Connections: To handle cases where many ingredients go into a process, use XML elements to define relationships between ingredients and processes. For example, use a element with child elements.

Handling One-to-Many Connections: To handle cases where one material goes into many measurements, use XML elements to define relationships between materials and measurements. For example, use a element with a child element.

Generation of GEMD Models: Use XSLT (XML Stylesheet Language Transformations) to transform user-defined workflows into GEMD graph models. XSLT templates can be used to map XML elements to OpenMSIModel's 'entity/gemd', which maps to GEMD objects, which ultimately map to GraphML nodes and edges.

Coupling GEMD Model with File Links (and more): The goal of associations (see below) is to link modeling with key assets, particularly file links, by automatically generating file links based on rules and conditions related to asset features, like names. This emphasizes the importance of naming conventions, from which crucial information is extracted.

Benefits: Simplifies the creation of GEMD graph models for materials science workflows. Ensures consistency and standardization in representing workflows. Facilitates sharing and collaboration among researchers.

Initial example:

<workflow>
  <steps>
    <step>
      <name>Prepare Solution</name>
      <ingredients>
        <ingredient>ID1</ingredient>
        <ingredient>ID2</ingredient>
      </ingredients>
      <process>Mix</process>
      <materials>
        <material>ID3</material>
      </materials>
      <measurements>
        <measurement>ID4</measurement>
      </measurements>
    </step>
    <step>
      <!-- Next step in the workflow -->
    </step>
  </steps>
</workflow>
arachid1 commented 4 months ago

more advanced example:

<science_kit>
  <sequences>
    <sequence id="1">
      <name>Sample Preparation</name>
      <ingredients>
        <ingredient id="sampleA">
          <name>Sample A</name>
          <files>
            <file id="file1" link="https://example.com/sampleA_data.csv">Sample A Data</file>
          </files>
          <properties>
            <property name="temperature" value="25°C" />
            <property name="pressure" value="1 atm" />
          </properties>
        </ingredient>
        <ingredient id="sampleB">
          <name>Sample B</name>
          <files>
            <file id="file2" link="https://example.com/sampleB_data.csv">Sample B Data</file>
          </files>
          <properties>
            <property name="temperature" value="30°C" />
            <property name="pressure" value="1 atm" />
          </properties>
        </ingredient>
      </ingredients>
      <process>
        <name>Prepare Sample</name>
      </process>
      <materials>
        <material id="solution1">
          <name>Solution 1</name>
        </material>
      </materials>
      <measurements>
        <measurement id="measurement1">
          <name>Measurement 1</name>
        </measurement>
      </measurements>
    </sequence>
    <!-- 'ref' is used to reference a previously defined material -->
    <sequence id="2">
      <name>Analysis</name>
      <ingredients>
        <ingredient ref="solution1">
          <name>Solution 1</name>
        </ingredient>
      </ingredients>
      <process>
        <name>Analyze Samples</name>
        <parameters>
          <parameter name="analysis_method" value="spectroscopy" />
        </parameters>
        <conditions>
          <condition name="time" value="1 hour" />
          <condition name="temperature" value="30°C" />
        </conditions>
      </process>
      <materials>
        <material id="solution2">
          <name>Solution 2</name>
          <files>
            <file id="file3" link="https://example.com/solution2_data.csv">Solution 2 Data</file>
          </files>
        </material>
      </materials>
      <measurements>
        <measurement id="measurement2">
          <name>Measurement 2</name>
        </measurement>
        <measurement id="measurement3">
          <name>Measurement 3</name>
        </measurement>
      </measurements>
    </sequence>
  </sequences>
</science_kit>
arachid1 commented 4 months ago

Goal of associations:

To couple modeling with the most important assets, mainly file links Find a way to automatically generated file links from rules and conditions on assets, like file or folder name Emphasize the importance of naming conventions from which key information is extracted

example of associations:

<associations>
  <!-- For every material created, this will assign file links pointing to a real file placeholder named after the TRANSFORMED name of the material if the name matches the regex_rule -->
  <association asset="material" with="file_link" by="name" rule="regex_rule" transform="transform function">
    <generate_file>true</generate_file>
  </association>
  <!-- For every measurement created, this will assign file links pointing to a real folder placeholder named after the id of the measurement if the condition is met -->
  <association asset="measurement" with="file_link" by="id" condition="condition">
    <generate_folder>true</generate_folder>
  </association>
  <!-- For every property created, this will assign tags by applying the regex rules to the name and generate the tag identifier and value if the condition is met -->
  <association asset="property" with="tags" by="name" rule="regex_rule_for_tag_identifier:regex_rule_for_tag_value" condition="condition">
    <!-- Specify any additional parameters or actions here -->
  </association>
</associations>