leekgroup / recount

R package for the recount2 project. Documentation website: http://leekgroup.github.io/recount/
https://jhubiostatistics.shinyapps.io/recount/
40 stars 9 forks source link

column names of colData of RangedSummarizedExperiment inconsistent? #10

Closed mbernste closed 7 years ago

mbernste commented 7 years ago

Hi, I noticed that the names of the columns in the row data table for a RangedSummarizedExperiment object seem to be inconsistent with the data in the columns of the table, unless I am misunderstanding something (I am an R novice).

I download a RangedSummarizedExperiment as follows:

url <- download_study('SRP009615')
load(file.path('SRP009615', 'rse_gene.Rdata'))
rowData(rse_gene)

I get a DataFrame with 21 columns. The order of the names of the columns does not seem to coincide with the data in that column. For example, the first column name is "project"; however, the first column seems to contain the run accession. Is this a bug? Or is there another way I am supposed to find the name of each column?

Thanks!

lcolladotor commented 7 years ago

Hi @mbernste,

Since you mentioned that you are a new R user, you might want to check the vignette for the SummarizedExperiment package.

From your code, I believe that you meant to check colData() (information about the samples) instead of rowData() (information about the genes). The row names of the column data correspond to the SRA run identifier, not the SRA project identifier. The Sequence Read Archive (SRA) has multiple identifiers and the one that specifies a given sample is the run one.

In the future, I encourage you to use the Bioconductor support website https://support.bioconductor.org/ that has higher visibility, since other people might have the same questions you have. Remember to use tags!

Best, Leonardo

Un-evaluated code

library('recount')
library('devtools')

## Code from mbernste
url <- download_study('SRP009615')
load(file.path('SRP009615', 'rse_gene.Rdata'))
rowData(rse_gene)

## Explore the column data, not the row one
dim(colData(rse_gene))
colData(rse_gene)[, 1:4]
identical(rownames(colData(rse_gene)), colData(rse_gene)$run)

## Reproducibility info
proc.time()
message(Sys.time())
options(width = 120)
session_info()

Evaluated code

> library('recount')
> library('devtools')
> 
> ## Code from mbernste
> url <- download_study('SRP009615')
2017-05-05 12:09:09 downloading file rse_gene.Rdata to SRP009615
trying URL 'http://duffel.rail.bio/recount/SRP009615/rse_gene.Rdata'
Content type 'application/octet-stream' length 3120155 bytes (3.0 MB)
==================================================
downloaded 3.0 MB

> load(file.path('SRP009615', 'rse_gene.Rdata'))
> rowData(rse_gene)
DataFrame with 58037 rows and 3 columns
                 gene_id bp_length          symbol
             <character> <integer> <CharacterList>
1     ENSG00000000003.14      4535          TSPAN6
2      ENSG00000000005.5      1610            TNMD
3     ENSG00000000419.12      1207            DPM1
4     ENSG00000000457.13      6883           SCYL3
5     ENSG00000000460.16      5967        C1orf112
...                  ...       ...             ...
58033  ENSG00000283695.1        61              NA
58034  ENSG00000283696.1       997              NA
58035  ENSG00000283697.1      1184    LOC101928917
58036  ENSG00000283698.1       940              NA
58037  ENSG00000283699.1        60         MIR4481
> 
> ## Explore the column data, not the row one
> dim(colData(rse_gene))
[1] 12 21
> colData(rse_gene)[, 1:4]
DataFrame with 12 rows and 4 columns
              project      sample  experiment         run
          <character> <character> <character> <character>
SRR387777   SRP009615   SRS281685   SRX110461   SRR387777
SRR387778   SRP009615   SRS281686   SRX110462   SRR387778
SRR387779   SRP009615   SRS281687   SRX110463   SRR387779
SRR387780   SRP009615   SRS281688   SRX110464   SRR387780
SRR389077   SRP009615   SRS282369   SRX111299   SRR389077
...               ...         ...         ...         ...
SRR389080   SRP009615   SRS282372   SRX111302   SRR389080
SRR389081   SRP009615   SRS282373   SRX111303   SRR389081
SRR389082   SRP009615   SRS282374   SRX111304   SRR389082
SRR389083   SRP009615   SRS282375   SRX111305   SRR389083
SRR389084   SRP009615   SRS282376   SRX111306   SRR389084
> identical(rownames(colData(rse_gene)), colData(rse_gene)$run)
[1] TRUE
> 
> ## Reproducibility info
> proc.time()
   user  system elapsed 
 14.981   2.365 162.647 
> message(Sys.time())
2017-05-05 12:09:10
> options(width = 120)
> session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.0 (2017-04-21)
 system   x86_64, darwin15.6.0        
 ui       AQUA                        
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2017-05-05                  

Packages ---------------------------------------------------------------------------------------------------------------
 package              * version  date       source        
 acepack                1.4.1    2016-10-29 CRAN (R 3.4.0)
 AnnotationDbi          1.38.0   2017-04-25 Bioconductor  
 backports              1.0.5    2017-01-18 CRAN (R 3.4.0)
 base64enc              0.1-3    2015-07-28 CRAN (R 3.4.0)
 Biobase              * 2.36.2   2017-05-04 Bioconductor  
 BiocGenerics         * 0.22.0   2017-04-25 Bioconductor  
 BiocParallel           1.10.1   2017-05-03 Bioconductor  
 biomaRt                2.32.0   2017-04-26 Bioconductor  
 Biostrings             2.44.0   2017-04-25 Bioconductor  
 bitops                 1.0-6    2013-08-17 CRAN (R 3.4.0)
 BSgenome               1.44.0   2017-04-25 Bioconductor  
 bumphunter             1.16.0   2017-04-25 Bioconductor  
 checkmate              1.8.2    2016-11-02 CRAN (R 3.4.0)
 cluster                2.0.6    2017-03-10 CRAN (R 3.4.0)
 codetools              0.2-15   2016-10-05 CRAN (R 3.4.0)
 colorspace             1.3-2    2016-12-14 CRAN (R 3.4.0)
 data.table             1.10.4   2017-02-01 CRAN (R 3.4.0)
 DBI                    0.6-1    2017-04-01 CRAN (R 3.4.0)
 DelayedArray         * 0.2.0    2017-04-25 Bioconductor  
 derfinder              1.10.0   2017-04-25 Bioconductor  
 derfinderHelper        1.10.0   2017-04-25 Bioconductor  
 devtools             * 1.12.0   2016-12-05 CRAN (R 3.4.0)
 digest                 0.6.12   2017-01-27 CRAN (R 3.4.0)
 doRNG                  1.6.6    2017-04-10 CRAN (R 3.4.0)
 downloader             0.4      2015-07-09 CRAN (R 3.4.0)
 foreach                1.4.3    2015-10-13 CRAN (R 3.4.0)
 foreign                0.8-68   2017-04-24 CRAN (R 3.4.0)
 Formula                1.2-1    2015-04-07 CRAN (R 3.4.0)
 GenomeInfoDb         * 1.12.0   2017-04-25 Bioconductor  
 GenomeInfoDbData       0.99.0   2017-02-14 Bioconductor  
 GenomicAlignments      1.12.0   2017-04-25 Bioconductor  
 GenomicFeatures        1.28.0   2017-04-26 Bioconductor  
 GenomicFiles           1.12.0   2017-04-26 Bioconductor  
 GenomicRanges        * 1.28.1   2017-05-03 Bioconductor  
 GEOquery               2.42.0   2017-04-25 Bioconductor  
 ggplot2                2.2.1    2016-12-30 CRAN (R 3.4.0)
 gridExtra              2.2.1    2016-02-29 CRAN (R 3.4.0)
 gtable                 0.2.0    2016-02-26 CRAN (R 3.4.0)
 Hmisc                  4.0-3    2017-05-02 CRAN (R 3.4.0)
 htmlTable              1.9      2017-01-26 CRAN (R 3.4.0)
 htmltools              0.3.6    2017-04-28 CRAN (R 3.4.0)
 htmlwidgets            0.8      2016-11-09 CRAN (R 3.4.0)
 httr                   1.2.1    2016-07-03 CRAN (R 3.4.0)
 IRanges              * 2.10.0   2017-04-25 Bioconductor  
 iterators              1.0.8    2015-10-13 CRAN (R 3.4.0)
 jsonlite               1.4      2017-04-08 CRAN (R 3.4.0)
 knitr                  1.15.1   2016-11-22 CRAN (R 3.4.0)
 lattice                0.20-35  2017-03-25 CRAN (R 3.4.0)
 latticeExtra           0.6-28   2016-02-09 CRAN (R 3.4.0)
 lazyeval               0.2.0    2016-06-12 CRAN (R 3.4.0)
 locfit                 1.5-9.1  2013-04-20 CRAN (R 3.4.0)
 magrittr               1.5      2014-11-22 CRAN (R 3.4.0)
 Matrix                 1.2-10   2017-04-28 CRAN (R 3.4.0)
 matrixStats          * 0.52.2   2017-04-14 CRAN (R 3.4.0)
 memoise                1.1.0    2017-04-21 CRAN (R 3.4.0)
 munsell                0.4.3    2016-02-13 CRAN (R 3.4.0)
 nnet                   7.3-12   2016-02-02 CRAN (R 3.4.0)
 pkgmaker               0.22     2014-05-14 CRAN (R 3.4.0)
 plyr                   1.8.4    2016-06-08 CRAN (R 3.4.0)
 qvalue                 2.8.0    2017-04-25 Bioconductor  
 R6                     2.2.0    2016-10-05 CRAN (R 3.4.0)
 RColorBrewer           1.1-2    2014-12-07 CRAN (R 3.4.0)
 Rcpp                   0.12.10  2017-03-19 CRAN (R 3.4.0)
 RCurl                  1.95-4.8 2016-03-01 CRAN (R 3.4.0)
 recount              * 1.2.0    2017-04-25 Bioconductor  
 registry               0.3      2015-07-08 CRAN (R 3.4.0)
 rentrez                1.0.4    2016-10-26 CRAN (R 3.4.0)
 reshape2               1.4.2    2016-10-22 CRAN (R 3.4.0)
 rngtools               1.2.4    2014-03-06 CRAN (R 3.4.0)
 rpart                  4.1-11   2017-03-13 CRAN (R 3.4.0)
 Rsamtools              1.28.0   2017-04-25 Bioconductor  
 RSQLite                1.1-2    2017-01-08 CRAN (R 3.4.0)
 rtracklayer            1.36.0   2017-04-25 Bioconductor  
 S4Vectors            * 0.14.0   2017-04-25 Bioconductor  
 scales                 0.4.1    2016-11-09 CRAN (R 3.4.0)
 stringi                1.1.5    2017-04-07 CRAN (R 3.4.0)
 stringr                1.2.0    2017-02-18 CRAN (R 3.4.0)
 SummarizedExperiment * 1.6.1    2017-05-03 Bioconductor  
 survival               2.41-3   2017-04-04 CRAN (R 3.4.0)
 tibble                 1.3.0    2017-04-01 CRAN (R 3.4.0)
 VariantAnnotation      1.22.0   2017-04-25 Bioconductor  
 withr                  1.0.2    2016-06-20 CRAN (R 3.4.0)
 XML                    3.98-1.7 2017-05-03 CRAN (R 3.4.0)
 xtable                 1.8-2    2016-02-05 CRAN (R 3.4.0)
 XVector                0.16.0   2017-04-25 Bioconductor  
 zlibbioc               1.22.0   2017-04-25 Bioconductor  
> 
mbernste commented 7 years ago

Hi, thanks for your fast response. In the future I will post to the BioConductor forum for questions like this. I did mean to say colData, not rowData in my question; I apologize for the confusion and updated the title of the issue.

lcolladotor commented 7 years ago

No problem and have a good day ^^