terry-r123 / RNABenchmark

BEACON: Benchmark for Comprehensive RNA Tasks and Language Models
Apache License 2.0
14 stars 1 forks source link

BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

This is the official codebase of the paper BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

πŸ”₯ Update

Prerequisites

Installation

important libs: torch==1.13.1+cu117, transformers==4.38.1

git clone https://github.com/terry-r123/RNABenchmark.git
cd RNABenchmark
conda create -n beacon python=3.8
pip install -r requirements.txt

πŸ” Tasks and Datasets

Datasets of RNA tasks can be found in Google Drive

Model checkpoints of opensource RNA language models and BEACON-B can be found in Google Drive

Data structure

RNABenchmark
β”œβ”€β”€ checkpoint
β”‚   └── opensource
|       β”œβ”€β”€ rna-fm
|       β”œβ”€β”€ rnabert
|       β”œβ”€β”€ rnamsm
|       β”œβ”€β”€ splicebert-human510
|       β”œβ”€β”€ splicebert-ms510
|       β”œβ”€β”€ splicebert-ms1024
|       β”œβ”€β”€ utr-lm-mrl    
|       β”œβ”€β”€ utr-lm-te-el    
|       β”œβ”€β”€ utrbert-3mer    
|       β”œβ”€β”€ utrbert-4mer  
|       β”œβ”€β”€ utrbert-5mer  
|       └── utrbert-6mer   
β”‚   └── baseline
|       β”œβ”€β”€ BEACON-B
|       └── BEACON-B512
β”œβ”€β”€ data
β”‚    β”œβ”€β”€ ContactMap
β”‚    β”œβ”€β”€ CRISPROffTarget
β”‚    β”œβ”€β”€ CRISPROnTarget
β”‚    β”œβ”€β”€ Degradation
β”‚    β”œβ”€β”€ DistanceMap
β”‚    β”œβ”€β”€ Isoform
β”‚    β”œβ”€β”€ MeanRibosomeLoading
β”‚    β”œβ”€β”€ Modification
β”‚    β”œβ”€β”€ NoncodingRNAFamily
β”‚    β”œβ”€β”€ ProgrammableRNASwitches
β”‚    β”œβ”€β”€ Secondary_structure_prediction
β”‚    β”œβ”€β”€ SpliceAI
β”‚    └── StructuralScoreImputation
β”œβ”€β”€ downstream
β”‚   └── structure
β”œβ”€β”€ model
|   |── rna-fm
|   β”œβ”€β”€ rnabert
|   β”œβ”€β”€ rnamsm
|   β”œβ”€β”€ splicebert
|   β”œβ”€β”€ utrlm      
|   β”œβ”€β”€ utrbert   
|   └── rnalm  
β”œβ”€β”€ tokenizer
└── scripts
β”‚    β”œβ”€β”€ BEACON-B
β”‚    └── opensource

The full list of current task names are :

πŸ”Models

And the list of available embedders/models used for training on the tasks are :

Models name token pos length
RNA-FM rna-fm single ape 1024
RNABERT rnabert single ape 440
RNA-MSM rnamsm single ape 1024
SpliceBERT-H510 splicebert-human510 single ape 510
SpliceBERT-MS510 splicebert-ms510 single ape 510
SpliceBERT-MS510 splicebert-ms510 single ape 1024
UTR-LM-MRL utr-lm-mrl single rope 1026
UTR-LM-TE&EL utr-lm-te-el single rope 1026
UTRBERT-3mer utrbert-3mer 3mer ape 512
UTRBERT-4mer utrbert-4mer 4mer ape 512
UTRBERT-5mer utrbert-5mer 5mer ape 512
UTRBERT-6mer utrbert-6mer 6mer ape 512
BEACON-B rnalm single alibi 1026
BEACON-B512 rnalm single alibi 512

πŸ” Usage

Finetuning

To evalute on all RNA tasks, you can run the bash scripts in the scripts folder, for example:

cd RNABenchmark
bash ./scripts/BEACON-B/all_task.sh

Computing embeddings

Embeddings from a dummy RNA sequence can be used as follows:

import os, sys
current_path = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_path)
sys.path.append(parent_dir)
from model.utrlm.modeling_utrlm import UtrLmModel
from tokenizer.tokenization_opensource import OpenRnaLMTokenizer

tokenizer = OpenRnaLMTokenizer.from_pretrained('./checkpoint/opensource/utr-lm-mrl', model_max_length=1026, padding_side="right", use_fast=True,)
model = UtrLmModel.from_pretrained('./checkpoint/opensource/utr-lm-mrl')     
sequences = ["AUUCCGAUUCCGAUUCCG"]
output = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="longest", max_length = 1026, truncation=True)
input_ids = output["input_ids"]
attention_mask = output["attention_mask"]

embedding = model(input_ids=input_ids,attention_mask=attention_mask)[0] # shape [bz,length, hidden_size]
print(embedding.shape)

License

This codebase is released under the Apache License 2.0 as in the LICENSE file.

Citation

If you find this repo useful for your research, please consider citing the paper

@misc{ren2024beacon,
      title={BEACON: Benchmark for Comprehensive RNA Tasks and Language Models}, 
      author={Yuchen Ren and Zhiyuan Chen and Lifeng Qiao and Hongtai Jing and Yuchen Cai and Sheng Xu and Peng Ye and Xinzhu Ma and Siqi Sun and Hongliang Yan and Dong Yuan and Wanli Ouyang and Xihui Liu},
      year={2024},
      eprint={2406.10391},
      archivePrefix={arXiv},
      primaryClass={id='q-bio.QM' full_name='Quantitative Methods' is_active=True alt_name=None in_archive='q-bio' is_general=False description='All experimental, numerical, statistical and mathematical contributions of value to biology'}
}