tRFdb-tsRNA数据库爬虫下载fa序列

tRFdb-tsRNA数据库爬虫下载fa序列 by 生信菜鸟团

目前大多数的tsRNA数据库基本上都没有提供数据下载的接口，比如 tRFdb：http://genome.bioch.virginia.edu/trfdb/

官网如下：

此数据库主要使用GEO与NCBI SRA数据库的small RNA high-throughput sequencing data进行tsRNA鉴定，提供了八大物种：

Rhodobacter sphaeroides (Bacteria;ATCC 17025) 类球红细菌
Schizosaccharomyces pombe (schiPomb1) 粟酒裂殖酵母
Drosophila (dm3) 果蝇
Caenorhabditis elegans (ce6)
秀丽隐杆线虫Xenopus(xenTro3)
爪蟾zebra fish (Zv9) 斑马鱼
mouse (mm9) 小鼠
human(hg19) 人

的tsRNA数据可以进行查询。目前数据库可以提供以下几种查询方式：tRF类型，tRF ID以及一段序列。

此次，我们的目的就是从这个数据库里以Human为例，把数据库中的tRF ID与Sequence提取下来。

首先，tRF Type选择tRF-5，物种选择Human，然后点击Get tRFs，就会得到以下页面：

此页面，单击鼠标右键，查看网页源码：

得到下面源码页面：

将此页面保存下来，ctrl+s，另存为：view-source_genome.bioch.virginia.edu_trfdb_search.php-Human-tRF-5.html

上面的搜索页面，选择一个Sequence，点击进去，页面如下，可以看到我们想要的信息都在这里，然后观察这个页面可以看到网址的特点：http://genome.bioch.virginia.edu/trfdb/sequence_display.php?seq_id=15011

由固定的http://genome.bioch.virginia.edu/trfdb/sequence_display.php?seq_id=加上seq_id。

在结合上面那个表格，就可以把所有数据都薅下来了。

下面是代码部分：

rm(list=ls())
# R 里面重要的一个读取网页的扩展包
library(RCurl) 
library(dplyr) 
library(rvest)
library(tidyverse)

opt <- list(html = "view-source_genome.bioch.virginia.edu_trfdb_search.php-Human-tRF-5.html",
            organism = "human",
            type = "trf-5",
            od = "./")

html <- read_html(opt$html)
html <- html_text(html)

#提取出所有匹配的内容
#以矩阵形式返回所有匹配到的内容，并将每一行元素个数统一，不够的用""空字符串表示
#此处的正则表达式有小改动，以便演示能匹配到多个的情况
type <- opt$type
trf_id <- t(str_extract_all(html,"trf_id=[0-9]+[a-z]*",simplify = T))
Organism <- t(str_extract_all(html,"organism=[a-z]+",simplify = T))
tRNA_Gene_Coordinates <- t(str_extract_all(html,"chr.{1,2}-[0-9]+-[0-9]+",simplify = T))
tRNA_Name <- t(str_extract_all(html,"chr.{1,2}\\.trna[0-9]+-.{1,6}",simplify = T))
seq_id <- t(str_extract_all(html,"seq_id=[0-9]+",simplify = T))
seq_id_1 <- gsub("seq_id=", "", seq_id)
url <- paste0("http://genome.bioch.virginia.edu/trfdb/sequence_display.php?", seq_id)

## combine the result
res <- data.frame(trf_id, Organism, type, tRNA_Gene_Coordinates, tRNA_Name, seq_id=seq_id_1, url, check.names = F)
res <- unique(res)

## get trf fastq and id from seq_id_url
tRF_ID <- NULL
Organism_1 <- NULL
Sequence <- NULL
Map_Position <- NULL

for(i in 1:nrow(res)){
  dir.create(paste0(opt$od, "seq_id/"))
  out <- paste0(opt$od, "./seq_id/", res$seq_id[i])
  if(!file.exists(out)){
    download.file(res$url[i], cacheOK = T, destfile=out, method = "curl")
  }
  
  b <- html_text(read_html(out))
  tRF_ID[i] <- str_extract(b,"tRF ID: [0-9]+[a-z]*")
  Organism_1[i] <- str_extract(b,"Organism: [a-z]{1,5}")
  Sequence[i] <- str_extract(b,"Sequence: [ATGCN]+")
  Map_Position[i] <- str_extract(b,"Map Position: [0-9]+-[0-9]+")
}

## final result
res_final <- cbind(res, tRF_ID, Organism_1, Sequence, Map_Position)

out_file <- paste0(opt$od, "/tRFdb_", opt$organism, "_", opt$type, "_infor.txt")
write.table(res_final, file = out_file, row.names = F,sep = "\t",quote = F)

结果如下：

此外，tRF-3与tRF-1的结果也可类似，愉快的下载下来了~

ixxmu / mp_duty

tRFdb-tsRNA数据库爬虫下载fa序列 #3602

tRFdb-tsRNA数据库爬虫下载fa序列 by 生信菜鸟团