tracel-ai / models

Models and examples built with Burn
Apache License 2.0
186 stars 26 forks source link

Feature chemical models of prediction and generation support with string representation. #49

Open linjing-lab opened 2 days ago

linjing-lab commented 2 days ago

Notice tracel-ai from burn framework, this software must substitute to high performance predictions, like robotics, predict from data lake. Some molecular pretrained models use RoBERTa as base model, like ChemBERTa, ChemBERTa-2, MFBERT, SELFormer, Semi-RoBERTa. Some protein pretrained models use RoBERTa as base model, like ESM-1b, ESM-2, PromptProtein, KeAP. Those are encode-only tasks which compatible with models from tracel-ai from the inference performance perspective, recommend models provide burn-based multi-strings examples for molecules, proteins, genomics, and multi-modal level sets.

This repository has CRAFT model which may used in Structure-based task, but wasn't clear enough in reality design, like MolCRAFT of continuous parameter space for drug design. Clear chemical compatibility was constrained with maximized purpose of script character interpretation, not only abstract design for kind machine schedules. Abstract interpretation can always export new distributed abstract operators, which reflect machine memory and times, think tracel-ai features more decode tasks, and seek low memory from correlation when string to continuous space. Now multi-objective and chemical prediction happens in one possible history, from explanation, to distributed stream pattern.

antimora commented 2 hours ago

@linjing-lab. Thanks for you the feedback. So we understand your request, can you confirm the following issue description is accurate? This is my interpretation of your issue raised.


Title: Support for Chemical and Biological Sequence Models Utilizing String Representations

Description:

To enhance the repository's applicability in cheminformatics and bioinformatics, it is proposed to integrate models capable of processing chemical and biological sequences represented as strings. This includes handling molecular structures via SMILES (Simplified Molecular Input Line Entry System) and protein sequences through amino acid representations.

Proposed Enhancements:

  1. Incorporate Molecular Models:

    • Develop and include models similar to ChemBERTa, ChemBERTa-2, MFBERT, SELFormer, and Semi-RoBERTa, which are based on the RoBERTa architecture and designed for molecular data processing.
  2. Integrate Protein Sequence Models:

    • Add models akin to ESM-1b, ESM-2, PromptProtein, and KeAP, which utilize RoBERTa for protein sequence analysis.
  3. Enhance Existing Models:

    • Refine the current CRAFT model to improve its design clarity and functionality, enabling support for continuous parameter spaces in drug design, similar to MolCRAFT.

Objective:

These enhancements aim to broaden the repository's utility in fields such as drug discovery and genomics by providing high-performance models built with the Burn framework, capable of efficient inference on molecular and protein sequence data.