Open arachid1 opened 4 months ago
We propose a text-based solution using XML to generate GEMD graph models from user-defined workflows.
Proposed Solution:
XML Schema Design: Define an XML schema that represents the structure of GEMD graph models, including elements for ingredients, processes, materials, and measurements. Use XML attributes and elements to capture details such as IDs, names, and properties.
Workflow Specification: Allow users to define their workflows in XML format, specifying the sequence of ingredients, processes, materials, and measurements. Use XML elements to represent each step in the workflow.
Handling Many-to-One Connections: To handle cases where many ingredients go into a process, use XML elements to define relationships between ingredients and processes. For example, use a
Handling One-to-Many Connections: To handle cases where one material goes into many measurements, use XML elements to define relationships between materials and measurements. For example, use a
Generation of GEMD Models: Use XSLT (XML Stylesheet Language Transformations) to transform user-defined workflows into GEMD graph models. XSLT templates can be used to map XML elements to OpenMSIModel's 'entity/gemd', which maps to GEMD objects, which ultimately map to GraphML nodes and edges.
Coupling GEMD Model with File Links (and more): The goal of associations (see below) is to link modeling with key assets, particularly file links, by automatically generating file links based on rules and conditions related to asset features, like names. This emphasizes the importance of naming conventions, from which crucial information is extracted.
Benefits: Simplifies the creation of GEMD graph models for materials science workflows. Ensures consistency and standardization in representing workflows. Facilitates sharing and collaboration among researchers.
Initial example:
<workflow>
<steps>
<step>
<name>Prepare Solution</name>
<ingredients>
<ingredient>ID1</ingredient>
<ingredient>ID2</ingredient>
</ingredients>
<process>Mix</process>
<materials>
<material>ID3</material>
</materials>
<measurements>
<measurement>ID4</measurement>
</measurements>
</step>
<step>
<!-- Next step in the workflow -->
</step>
</steps>
</workflow>
more advanced example:
<science_kit>
<sequences>
<sequence id="1">
<name>Sample Preparation</name>
<ingredients>
<ingredient id="sampleA">
<name>Sample A</name>
<files>
<file id="file1" link="https://example.com/sampleA_data.csv">Sample A Data</file>
</files>
<properties>
<property name="temperature" value="25°C" />
<property name="pressure" value="1 atm" />
</properties>
</ingredient>
<ingredient id="sampleB">
<name>Sample B</name>
<files>
<file id="file2" link="https://example.com/sampleB_data.csv">Sample B Data</file>
</files>
<properties>
<property name="temperature" value="30°C" />
<property name="pressure" value="1 atm" />
</properties>
</ingredient>
</ingredients>
<process>
<name>Prepare Sample</name>
</process>
<materials>
<material id="solution1">
<name>Solution 1</name>
</material>
</materials>
<measurements>
<measurement id="measurement1">
<name>Measurement 1</name>
</measurement>
</measurements>
</sequence>
<!-- 'ref' is used to reference a previously defined material -->
<sequence id="2">
<name>Analysis</name>
<ingredients>
<ingredient ref="solution1">
<name>Solution 1</name>
</ingredient>
</ingredients>
<process>
<name>Analyze Samples</name>
<parameters>
<parameter name="analysis_method" value="spectroscopy" />
</parameters>
<conditions>
<condition name="time" value="1 hour" />
<condition name="temperature" value="30°C" />
</conditions>
</process>
<materials>
<material id="solution2">
<name>Solution 2</name>
<files>
<file id="file3" link="https://example.com/solution2_data.csv">Solution 2 Data</file>
</files>
</material>
</materials>
<measurements>
<measurement id="measurement2">
<name>Measurement 2</name>
</measurement>
<measurement id="measurement3">
<name>Measurement 3</name>
</measurement>
</measurements>
</sequence>
</sequences>
</science_kit>
Goal of associations:
To couple modeling with the most important assets, mainly file links Find a way to automatically generated file links from rules and conditions on assets, like file or folder name Emphasize the importance of naming conventions from which key information is extracted
example of associations:
<associations>
<!-- For every material created, this will assign file links pointing to a real file placeholder named after the TRANSFORMED name of the material if the name matches the regex_rule -->
<association asset="material" with="file_link" by="name" rule="regex_rule" transform="transform function">
<generate_file>true</generate_file>
</association>
<!-- For every measurement created, this will assign file links pointing to a real folder placeholder named after the id of the measurement if the condition is met -->
<association asset="measurement" with="file_link" by="id" condition="condition">
<generate_folder>true</generate_folder>
</association>
<!-- For every property created, this will assign tags by applying the regex rules to the name and generate the tag identifier and value if the condition is met -->
<association asset="property" with="tags" by="name" rule="regex_rule_for_tag_identifier:regex_rule_for_tag_value" condition="condition">
<!-- Specify any additional parameters or actions here -->
</association>
</associations>
Problem Statement:
Materials science workflows often involve complex sequences of ingredients, processes, and measurements. Capturing these workflows in a standardized format such as GEMD can facilitate collaboration and reproducibility. However, creating GEMD models manually can be time-consuming and error-prone. It also has difficulties capturing a file or folder structure, often the most critical assets of a scientist, due to inadequate naming conventions or structures.
From another perspective, we could ask:
Could we build a GEMD model backbone if we had answers to these questions: