Open linjing-lab opened 2 days ago
@linjing-lab. Thanks for you the feedback. So we understand your request, can you confirm the following issue description is accurate? This is my interpretation of your issue raised.
Title: Support for Chemical and Biological Sequence Models Utilizing String Representations
Description:
To enhance the repository's applicability in cheminformatics and bioinformatics, it is proposed to integrate models capable of processing chemical and biological sequences represented as strings. This includes handling molecular structures via SMILES (Simplified Molecular Input Line Entry System) and protein sequences through amino acid representations.
Proposed Enhancements:
Incorporate Molecular Models:
Integrate Protein Sequence Models:
Enhance Existing Models:
Objective:
These enhancements aim to broaden the repository's utility in fields such as drug discovery and genomics by providing high-performance models built with the Burn framework, capable of efficient inference on molecular and protein sequence data.
Notice tracel-ai from burn framework, this software must substitute to high performance predictions, like robotics, predict from data lake. Some molecular pretrained models use RoBERTa as base model, like ChemBERTa, ChemBERTa-2, MFBERT, SELFormer, Semi-RoBERTa. Some protein pretrained models use RoBERTa as base model, like ESM-1b, ESM-2, PromptProtein, KeAP. Those are encode-only tasks which compatible with models from tracel-ai from the inference performance perspective, recommend models provide burn-based multi-strings examples for molecules, proteins, genomics, and multi-modal level sets.
This repository has CRAFT model which may used in Structure-based task, but wasn't clear enough in reality design, like MolCRAFT of continuous parameter space for drug design. Clear chemical compatibility was constrained with maximized purpose of script character interpretation, not only abstract design for kind machine schedules. Abstract interpretation can always export new distributed abstract operators, which reflect machine memory and times, think tracel-ai features more decode tasks, and seek low memory from correlation when string to continuous space. Now multi-objective and chemical prediction happens in one possible history, from explanation, to distributed stream pattern.