rnabioco / valr

Genome Interval Arithmetic in R
http://rnabioco.github.io/valr/
Other
88 stars 25 forks source link

Alphanumerical Sorting by Chromosome #352

Closed canholyavkin closed 4 years ago

canholyavkin commented 4 years ago

The bed_sort function doesn't seem to sort the bed file alphanumerical by chromosome column. In the example below, the chromosome 11 record located on the second row after sorting, while it should be at the end. by_chrom = T parameter also doesn't change the order.

rd <- tibble::tribble(
  ~chrom, ~start, ~end, 
  "chr1", 1, 100,
  "chr11", 1, 100,
  "chr4", 1, 100,
  "chr2", 1, 100,
)

bed_sort(rd, by_chrom = T)
# A tibble: 4 x 3
  chrom start   end
  <chr> <dbl> <dbl>
1 chr1      1   100
2 chr11     1   100
3 chr2      1   100
4 chr4      1   100
kriemo commented 4 years ago

Hi. Thanks for your interest in valr. The default sorting is by chrom, start, and end. The by_chrom argument only affects the output when by_size = TRUE or reverse = TRUE, as by default the output is sorted by chromosome first.

The sorting of the chrom column is alphanumeric. This is the default convention used by bedtools, but is in contrast to the sort order of bam files which sort by chromosome size (as defined in the bam header).

library(valr)
library(tibble)
library(dplyr)

rd <- tibble::tribble(
  ~chrom, ~start, ~end, 
  "chr1", 1, 100,
  "chr11", 1, 100,
  "chr4", 1, 100,
  "chr2", 1, 100,
)

# base R sorting 
sort(rd$chrom)
#> [1] "chr1"  "chr11" "chr2"  "chr4"

"chr11" < "chr2"
#> [1] TRUE

To define a custom sort (non-alphanumeric) order you could use a "genome"-like data.frame that defines the chromosome ordering and pipe output from valr to dplyr::left_join()

# retain sorting by chrom order defined in other data.frame
chroms <- tribble(
  ~chrom, 
  "chr1", 
  "chr2", 
  "chr4",
  "chr11"
)

bed_intersect(rd, rd, suffix = c("", "_b")) %>% 
  left_join(chroms, ., by = c("chrom"))
#> # A tibble: 4 x 6
#>   chrom start   end start_b end_b .overlap
#>   <chr> <dbl> <dbl>   <dbl> <dbl>    <int>
#> 1 chr1      1   100       1   100       99
#> 2 chr2      1   100       1   100       99
#> 3 chr4      1   100       1   100       99
#> 4 chr11     1   100       1   100       99

Created on 2020-02-19 by the reprex package (v0.3.0)

Here is the default sorting used in bedtools sort

 echo -e "chr1\t1\t100\nchr11\t1\t100\nchr2\t1\t100" | bedtools sort
chr1    1   100
chr11   1   100
chr2    1   100
canholyavkin commented 4 years ago

Thank you for clear and detailed explanation.