tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
140 stars 18 forks source link

[BUG] read_bed adds NA., NA.1 ... col names that then cannot be renamed by select #103

Closed ConYel closed 1 year ago

ConYel commented 1 year ago

Dear Stuart, thanks a lot for this wonderful package. There is a bug that it came to my attention from using the read_bed function. Issue found here: https://github.com/ConYel/wind/issues/2#issue-1859848020

Context

The package when I made the workflow was : plyranges_1.11.0 and at that time I was able to perform this code without issues: Docker/ linux: wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/15.0/genome_coordinates/bed/mus_musculus.GRCm38.bed.gz \ -O my_data/mouse_data/RNACentral/mus_musculus.GRCm38.bed.gz

library(tidyverse)
library(data.table)
library(plyranges)
library(BSgenome.Mmusculus.UCSC.mm38)
library(here)

## import RNACentral bed file
sRNA <- here("mus_musculus.GRCm38.bed.gz") %>% 
    read_bed() %>% 
    select("sRNA_id" = name, "gene_type" = NA.1, "source" = NA.2) %>% 
    mutate(type = "exon")

Currently with plyranges_1.20.0 (and the new version of the bed file from RNAcentral) the same code throws an error:

sRNA <- here("mus_musculus.GRCm39.bed.gz") %>% 
    read_bed() %>% 
    select("sRNA_id" = name, "gene_type" = NA.1, "source" = NA.2) %>% 
    mutate(type = "exon")

I checked that the bed file seems to be the same as before, and I saw that when I load it select has a problem picking the metadata columns starting with "NA" :

sRNA <- here("mus_musculus.GRCm39.bed.gz") %>% 
    read_bed() 

sRNA
GRanges object with 1023685 ranges and 8 metadata columns:
            seqnames            ranges strand |                name
               <Rle>         <IRanges>  <Rle> |         <character>
        [1]     chr1   3056358-3056384      - | URS0000253E70_10090
        [2]     chr1   3056360-3056384      - | URS000042B783_10090
        [3]     chr1   3056361-3056384      - | URS00004DC026_10090
        [4]     chr1   3056758-3056787      - | URS0000305867_10090
        [5]     chr1   3056765-3056792      - | URS00003E8F70_10090
        ...      ...               ...    ... .                 ...
  [1023681]     chrY 90855564-90855592      - | URS0000102C58_10090
  [1023682]     chrY 90855590-90855616      - | URS00002D7F5A_10090
  [1023683]     chrY 90855621-90855705      - | URS000028A7C4_10090
  [1023684]     chrY 90855633-90855705      - | URS000061C2F2_10090
  [1023685]     chrY 90855635-90855706      - | URS00001B61F7_10090
                score     itemRgb       NA.       NA..1       NA..2
            <numeric> <character> <logical> <character> <character>
        [1]         0     #3F7D97      <NA>       piRNA ENA,PirBase
        [2]         0     #3F7D97      <NA>       piRNA ENA,PirBase
        [3]         0     #3F7D97      <NA>       piRNA ENA,PirBase
        [4]         0     #3F7D97      <NA>       piRNA ENA,PirBase
        [5]         0     #3F7D97      <NA>       piRNA ENA,PirBase
        ...       ...         ...       ...         ...         ...
  [1023681]         0     #3F7D97      <NA>       piRNA ENA,PirBase
  [1023682]         0     #3F7D97      <NA>       piRNA ENA,PirBase
  [1023683]         0     #3F7D97      <NA>        tRNA         ENA
  [1023684]         0     #3F7D97      <NA>        tRNA         ENA
  [1023685]         0     #3F7D97      <NA>        tRNA         ENA
                        thick        blocks
                    <IRanges> <IRangesList>
        [1]   3056358-3056384          1-27
        [2]   3056360-3056384          1-25
        [3]   3056361-3056384          1-24
        [4]   3056758-3056787          1-30
        [5]   3056765-3056792          1-28
        ...               ...           ...
  [1023681] 90855564-90855592          1-29
  [1023682] 90855590-90855616          1-27
  [1023683] 90855621-90855705     1-5,18-85
  [1023684] 90855633-90855705          1-73
  [1023685] 90855635-90855706          1-72
  -------
  seqinfo: 59 sequences from an unspecified genome; no seqlengths

Select works fine with the other columns and it can rename them too:

sRNA %>% select(score, "sRNA_id" = name, blocks)
GRanges object with 1023685 ranges and 3 metadata columns:
            seqnames            ranges strand |     score
               <Rle>         <IRanges>  <Rle> | <numeric>
        [1]     chr1   3056358-3056384      - |         0
        [2]     chr1   3056360-3056384      - |         0
        [3]     chr1   3056361-3056384      - |         0
        [4]     chr1   3056758-3056787      - |         0
        [5]     chr1   3056765-3056792      - |         0
        ...      ...               ...    ... .       ...
  [1023681]     chrY 90855564-90855592      - |         0
  [1023682]     chrY 90855590-90855616      - |         0
  [1023683]     chrY 90855621-90855705      - |         0
  [1023684]     chrY 90855633-90855705      - |         0
  [1023685]     chrY 90855635-90855706      - |         0
                        sRNA_id        blocks
                    <character> <IRangesList>
        [1] URS0000253E70_10090          1-27
        [2] URS000042B783_10090          1-25
        [3] URS00004DC026_10090          1-24
        [4] URS0000305867_10090          1-30
        [5] URS00003E8F70_10090          1-28
        ...                 ...           ...
  [1023681] URS0000102C58_10090          1-29
  [1023682] URS00002D7F5A_10090          1-27
  [1023683] URS000028A7C4_10090     1-5,18-85
  [1023684] URS000061C2F2_10090          1-73
  [1023685] URS00001B61F7_10090          1-72
  -------
  seqinfo: 59 sequences from an unspecified genome; no seqlengths

But when I try to pick the "NA" columns it throws the error:

sRNA %>% select(NA.1)
Error in select_rng(.data, .drop_ranges, ...) : 
  Cannot select/rename the following columns: seqnames, start, end, width, strand
> sRNA %>% select("NA.1")
Error in select_rng(.data, .drop_ranges, ...) : 
  Cannot select/rename the following columns: seqnames, start, end, width, strand

I'm trying to find a work around using as_tibble and rename but it seems it doesn't solve anything.

ConYel commented 1 year ago

Just saw that the columns names are ..1 ..2 instead of the precious .1 .2, I just changed them and it works properly.