BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

This is the official codebase of the paper BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

🔥 Update

[07/25]🔥 Updating models list and usage!
[06/11]🔥 BEACON is coming! We release the paper, code, data, and models for BEACON!

Prerequisites

Installation

important libs: torch==1.13.1+cu117, transformers==4.38.1

git clone https://github.com/terry-r123/RNABenchmark.git
cd RNABenchmark
conda create -n beacon python=3.8
pip install -r requirements.txt

🔍 Tasks and Datasets

Datasets of RNA tasks can be found in Google Drive

Model checkpoints of opensource RNA language models and BEACON-B can be found in Google Drive

Data structure

RNABenchmark
├── checkpoint
│   └── opensource
|       ├── rna-fm
|       ├── rnabert
|       ├── rnamsm
|       ├── splicebert-human510
|       ├── splicebert-ms510
|       ├── splicebert-ms1024
|       ├── utr-lm-mrl    
|       ├── utr-lm-te-el    
|       ├── utrbert-3mer    
|       ├── utrbert-4mer  
|       ├── utrbert-5mer  
|       └── utrbert-6mer   
│   └── baseline
|       ├── BEACON-B
|       └── BEACON-B512
├── data
│    ├── ContactMap
│    ├── CRISPROffTarget
│    ├── CRISPROnTarget
│    ├── Degradation
│    ├── DistanceMap
│    ├── Isoform
│    ├── MeanRibosomeLoading
│    ├── Modification
│    ├── NoncodingRNAFamily
│    ├── ProgrammableRNASwitches
│    ├── Secondary_structure_prediction
│    ├── SpliceAI
│    └── StructuralScoreImputation
├── downstream
│   └── structure
├── model
|   |── rna-fm
|   ├── rnabert
|   ├── rnamsm
|   ├── splicebert
|   ├── utrlm      
|   ├── utrbert   
|   └── rnalm  
├── tokenizer
└── scripts
│    ├── BEACON-B
│    └── opensource

The full list of current task names are :

Secondary_structure_prediction
ContactMap
DistanceMap
StructuralScoreImputation
SpliceAI
Isoform
NoncodingRNAFamily
Modification
MeanRibosomeLoading
Degradation
ProgrammableRNASwitches
CRISPROnTarget
CRISPROffTarget

🔍Models

And the list of available embedders/models used for training on the tasks are :

rna-fm
rnabert
rnamsm
utr-lm-mrl
utr-lm-te-el
splicebert-human510
splicebert-ms510
splicebert-ms1024
utrbert-3mer
utrbert-4mer
utrbert-5mer
utrbert-6mer
Model settings

Models	name	token	pos	length
RNA-FM	rna-fm	single	ape	1024
RNABERT	rnabert	single	ape	440
RNA-MSM	rnamsm	single	ape	1024
SpliceBERT-H510	splicebert-human510	single	ape	510
SpliceBERT-MS510	splicebert-ms510	single	ape	510
SpliceBERT-MS510	splicebert-ms510	single	ape	1024
UTR-LM-MRL	utr-lm-mrl	single	rope	1026
UTR-LM-TE&EL	utr-lm-te-el	single	rope	1026
UTRBERT-3mer	utrbert-3mer	3mer	ape	512
UTRBERT-4mer	utrbert-4mer	4mer	ape	512
UTRBERT-5mer	utrbert-5mer	5mer	ape	512
UTRBERT-6mer	utrbert-6mer	6mer	ape	512
BEACON-B	rnalm	single	alibi	1026
BEACON-B512	rnalm	single	alibi	512

🔍 Usage

Finetuning

To evalute on all RNA tasks, you can run the bash scripts in the scripts folder, for example:

cd RNABenchmark
bash ./scripts/BEACON-B/all_task.sh

Computing embeddings

Embeddings from a dummy RNA sequence can be used as follows:

import os, sys
current_path = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_path)
sys.path.append(parent_dir)
from model.utrlm.modeling_utrlm import UtrLmModel
from tokenizer.tokenization_opensource import OpenRnaLMTokenizer

tokenizer = OpenRnaLMTokenizer.from_pretrained('./checkpoint/opensource/utr-lm-mrl', model_max_length=1026, padding_side="right", use_fast=True,)
model = UtrLmModel.from_pretrained('./checkpoint/opensource/utr-lm-mrl')     
sequences = ["AUUCCGAUUCCGAUUCCG"]
output = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="longest", max_length = 1026, truncation=True)
input_ids = output["input_ids"]
attention_mask = output["attention_mask"]

embedding = model(input_ids=input_ids,attention_mask=attention_mask)[0] # shape [bz,length, hidden_size]
print(embedding.shape)

License

This codebase is released under the Apache License 2.0 as in the LICENSE file.

Citation

If you find this repo useful for your research, please consider citing the paper

@misc{ren2024beacon,
      title={BEACON: Benchmark for Comprehensive RNA Tasks and Language Models}, 
      author={Yuchen Ren and Zhiyuan Chen and Lifeng Qiao and Hongtai Jing and Yuchen Cai and Sheng Xu and Peng Ye and Xinzhu Ma and Siqi Sun and Hongliang Yan and Dong Yuan and Wanli Ouyang and Xihui Liu},
      year={2024},
      eprint={2406.10391},
      archivePrefix={arXiv},
      primaryClass={id='q-bio.QM' full_name='Quantitative Methods' is_active=True alt_name=None in_archive='q-bio' is_general=False description='All experimental, numerical, statistical and mathematical contributions of value to biology'}
}

terry-r123 / RNABenchmark

readme