tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
140 stars 18 forks source link

[feature request] Import Blast tabular with comment lines #88

Open ConYel opened 3 years ago

ConYel commented 3 years ago

Hello and thank you for this impressive package! As I'm working with some genomic ranges lately I would like to include some results from BLAST. At first, I thought that it would be quite easy to include the results as granges but until now I haven't found a function that imports a blast resulting file in tabular with comment lines format.

For the implementation, I used an example (with some changes) from here: In my case I use docker to run BLAST, you can find all the corresponding code below:

Context

Import BLAST output as Genomic Ranges

Code

Run docker, download database and example, output the result to current directory

docker run --rm -ti -v $(pwd):/home/my_data  ncbi/blast

curl -O https://raw.githubusercontent.com/mhahsler/rBLAST/master/inst/examples/RNA_example.fasta

curl -O https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz

mkdir 16SMicrobialDB

tar -zxvf 16S_ribosomal_RNA.tar.gz  -C 16SMicrobialDB

blastn -db 16SMicrobialDB/16S_ribosomal_RNA -query RNA_example.fasta -out "/home/my_data/results_blast.txt" -outfmt "7 delim=@ qseqid qlen sseqid slen qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive ppos frames qframe sframe sstrand qcovs qcovhsp qcovus" -ungapped -num_threads 2

exit

Import results, check lengths of Genomic Ranges that are correct and provide a pseudo-function for later development.

suppressPackageStartupMessages({
library(tidyverse)
library(plyranges)
library(vroom)}
)

## find how many fields exist for the column names
blast_file <- file.path("results_blast.txt")

fourth_line_fields <-  blast_file %>% 
  vroom_lines() %>% 
  .[4] %>% 
  str_remove("# Fields: ") %>% 
  str_split(", ") %>% 
  unlist %>% 
  str_replace_all(" |/","_") %>% 
  str_replace("%","percent") %>% 
  str_remove("\\.")

## create basic columns for the creation of a GR object
blast_sequences <- blast_file %>% 
  read_delim(delim = "@", # here it should be made clear that the delim could be something else
             comment = "#", 
             col_names = fourth_line_fields) %>% 
  select(ends_with(c("start","end","seq","length","strand","subject_id"))) %>% 
  mutate(strand = if_else(subject_strand == "minus","-","+") %>% as_factor(),
         start = if_else(strand == "-", s_end, s_start),
         end = if_else(strand == "+", s_start, s_end),
         seqnames = subject_id)

## check if lengths are correct
blast_sequences %>% 
  mutate(qGR_len = abs(q_end-q_start )+1,
         sGR_len = abs(end-start)+1,
         qseq_len= str_length(query_seq), 
         sseq_len = str_length(subject_seq )) %>% 
  mutate(isittrue = qGR_len == sGR_len & qseq_len == sseq_len & qGR_len == sseq_len) %>% 
  count(isittrue)

#transform from tibble to Granges and then to tibble, check if lengths are correct again
blast_sequences %>% 
  mutate(qGR_len = abs(q_end-q_start )+1,
         sGR_len = abs(end-start)+1,
         qseq_len= str_length(query_seq), 
         sseq_len = str_length(subject_seq )) %>% 
  select(seqnames, start, end, strand, qGR_len:sseq_len) %>% 
  as_granges() %>% 
  as_tibble() %>% 
  mutate(isittrue = qGR_len == sGR_len & qseq_len == sseq_len & qGR_len == sseq_len & qGR_len == width ) %>% 
  count(isittrue)

# Function?
read_blast_tab_format <- function(blast_file, delim = "@"){
  fourth_line_fields <- blast_file %>% 
  vroom_lines() %>% 
  .[4] %>% 
  str_remove("# Fields: ") %>% 
  str_split(", ") %>% 
  unlist %>% 
  str_replace_all(" |/","_") %>% 
  str_replace("%","percent") %>% 
  str_remove("\\.")

  blast_sequences <- blast_file %>% 
  read_delim(delim = delim, # here it should be made clear that the delim could be something else
             comment = "#", 
             col_names = fourth_line_fields) %>% 
  mutate(strand = if_else(subject_strand == "minus","-","+") %>% as_factor(),
         start = if_else(strand == "-", s_end, s_start),
         end = if_else(strand == "+", s_start, s_end),
         seqnames = subject_id) %>% 
    as_granges()

  return(blast_sequences)
}

Small reproducible example

See above :)

R session information

Remember to include your full R session information.

Session info -------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/Berlin               
 date     2021-06-05                  

- Packages -----------------------------------------------------------------------------------------------------------
 package              * version  date       lib source        
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.0.0)
 backports              1.2.0    2020-11-02 [1] CRAN (R 4.0.3)
 Biobase                2.48.0   2020-04-27 [1] Bioconductor  
 BiocGenerics         * 0.34.0   2020-04-27 [1] Bioconductor  
 BiocParallel           1.22.0   2020-04-27 [1] Bioconductor  
 Biostrings             2.56.0   2020-04-27 [1] Bioconductor  
 bit                    4.0.4    2020-08-04 [1] CRAN (R 4.0.3)
 bit64                  4.0.5    2020-08-30 [1] CRAN (R 4.0.3)
 bitops                 1.0-6    2013-08-17 [1] CRAN (R 4.0.0)
 broom                  0.7.4    2021-01-29 [1] CRAN (R 4.0.3)
 cellranger             1.1.0    2016-07-27 [1] CRAN (R 4.0.0)
 cli                    2.3.0    2021-01-31 [1] CRAN (R 4.0.3)
 clipr                  0.7.1    2020-10-08 [1] CRAN (R 4.0.3)
 colorspace             2.0-0    2020-11-11 [1] CRAN (R 4.0.3)
 crayon                 1.4.0    2021-01-30 [1] CRAN (R 4.0.3)
 DBI                    1.1.1    2021-01-15 [1] CRAN (R 4.0.3)
 dbplyr                 2.0.0    2020-11-03 [1] CRAN (R 4.0.3)
 DelayedArray           0.14.1   2020-07-15 [1] Bioconductor  
 dplyr                * 1.0.3    2021-01-15 [1] CRAN (R 4.0.3)
 ellipsis               0.3.1    2020-05-15 [1] CRAN (R 4.0.0)
 fansi                  0.4.2    2021-01-15 [1] CRAN (R 4.0.3)
 forcats              * 0.5.1    2021-01-27 [1] CRAN (R 4.0.3)
 fs                     1.5.0    2020-07-31 [1] CRAN (R 4.0.3)
 generics               0.1.0    2020-10-31 [1] CRAN (R 4.0.3)
 GenomeInfoDb         * 1.24.2   2020-06-15 [1] Bioconductor  
 GenomeInfoDbData       1.2.3    2020-06-24 [1] Bioconductor  
 GenomicAlignments      1.24.0   2020-04-27 [1] Bioconductor  
 GenomicRanges        * 1.40.0   2020-04-27 [1] Bioconductor  
 ggplot2              * 3.3.3    2020-12-30 [1] CRAN (R 4.0.3)
 glue                   1.4.2    2020-08-27 [1] CRAN (R 4.0.3)
 gtable                 0.3.0    2019-03-25 [1] CRAN (R 4.0.0)
 haven                  2.3.1    2020-06-01 [1] CRAN (R 4.0.0)
 hms                    1.0.0    2021-01-13 [1] CRAN (R 4.0.3)
 httr                   1.4.2    2020-07-20 [1] CRAN (R 4.0.3)
 IRanges              * 2.22.2   2020-05-21 [1] Bioconductor  
 jsonlite               1.7.2    2020-12-09 [1] CRAN (R 4.0.3)
 knitr                  1.31     2021-01-27 [1] CRAN (R 4.0.3)
 lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.3)
 lifecycle              0.2.0    2020-03-06 [1] CRAN (R 4.0.0)
 lubridate              1.7.9.2  2020-11-13 [1] CRAN (R 4.0.3)
 magrittr               2.0.1    2020-11-17 [1] CRAN (R 4.0.3)
 Matrix                 1.2-18   2019-11-27 [1] CRAN (R 4.0.3)
 matrixStats            0.58.0   2021-01-29 [1] CRAN (R 4.0.3)
 modelr                 0.1.8    2020-05-19 [1] CRAN (R 4.0.0)
 munsell                0.5.0    2018-06-12 [1] CRAN (R 4.0.0)
 pillar                 1.4.7    2020-11-20 [1] CRAN (R 4.0.3)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.0.0)
 plyranges            * 1.8.0    2020-04-27 [1] Bioconductor  
 purrr                * 0.3.4    2020-04-17 [1] CRAN (R 4.0.0)
 R6                     2.5.0    2020-10-28 [1] CRAN (R 4.0.3)
 Rcpp                   1.0.6    2021-01-15 [1] CRAN (R 4.0.3)
 RCurl                  1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0)
 readr                * 1.4.0    2020-10-05 [1] CRAN (R 4.0.3)
 readxl                 1.3.1    2019-03-13 [1] CRAN (R 4.0.0)
 reprex                 1.0.0    2021-01-27 [1] CRAN (R 4.0.3)
 rlang                  0.4.10   2020-12-30 [1] CRAN (R 4.0.3)
 Rsamtools              2.4.0    2020-04-27 [1] Bioconductor  
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.0.3)
 rtracklayer            1.48.0   2020-04-27 [1] Bioconductor  
 rvest                  0.3.6    2020-07-25 [1] CRAN (R 4.0.3)
 S4Vectors            * 0.26.1   2020-05-16 [1] Bioconductor  
 scales                 1.1.1    2020-05-11 [1] CRAN (R 4.0.0)
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.0.0)
 stringi                1.5.3    2020-09-09 [1] CRAN (R 4.0.3)
 stringr              * 1.4.0    2019-02-10 [1] CRAN (R 4.0.0)
 SummarizedExperiment   1.18.2   2020-07-09 [1] Bioconductor  
 tibble               * 3.0.5    2021-01-15 [1] CRAN (R 4.0.3)
 tidyr                * 1.1.2    2020-08-27 [1] CRAN (R 4.0.3)
 tidyselect             1.1.0    2020-05-11 [1] CRAN (R 4.0.0)
 tidyverse            * 1.3.0    2019-11-21 [1] CRAN (R 4.0.3)
 utf8                   1.1.4    2018-05-24 [1] CRAN (R 4.0.0)
 vctrs                  0.3.6    2020-12-17 [1] CRAN (R 4.0.3)
 vroom                * 1.3.2    2020-09-30 [1] CRAN (R 4.0.3)
 withr                  2.4.1    2021-01-26 [1] CRAN (R 4.0.3)
 xfun                   0.20     2021-01-06 [1] CRAN (R 4.0.3)
 XML                    3.99-0.5 2020-07-23 [1] CRAN (R 4.0.3)
 xml2                   1.3.2    2020-04-23 [1] CRAN (R 4.0.0)
 XVector                0.28.0   2020-04-27 [1] Bioconductor  
 zlibbioc               1.34.0   2020-04-27 [1] Bioconductor 

Cheers! :beers: