ml4bio / RNA-FM

RNA foundation model
https://ml4bio.github.io/RNA-FM/
MIT License
203 stars 22 forks source link

Request an Access to Preprocessed Dataset RNAcentral100 #6

Closed yukang123 closed 10 months ago

yukang123 commented 10 months ago

Hi Guys! I am curious whether I can access the preprocessed dataset RNAcentral100 which was used to pre-train the foundation model. If not, should I directly download the data from RNAcentral website? https://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/19.0/.

image

Thanks a lot!

mydkzgj commented 10 months ago

Hi, I think you'd better download it from RNAcentral website directly. It offered a new release with more sequences recently. Then you can process it through cd-hit with your own settings.

yukang123 commented 10 months ago

Thanks! Sorry, I am not familiar with cd-hit. Is cd-hit the pre-processing step you took for pertaining? Where can I find the relevant scripts?

mydkzgj commented 10 months ago

You can try to follow their manual. https://sites.google.com/view/cd-hit

yukang123 commented 10 months ago

Got it. I will check it. It seems that I just need to replace the T with U and use cd-hit to reduce the redundancy. Am I right?

image