leekgroup / recount

R package for the recount2 project. Documentation website: http://leekgroup.github.io/recount/
https://jhubiostatistics.shinyapps.io/recount/
40 stars 9 forks source link

TCGA metadata issue in column age_at_initial_pathologic_diagnosis #8

Closed lcolladotor closed 7 years ago

lcolladotor commented 7 years ago

Hi,

@ShanEllis found a bug in the TCGA metadata. Basically, a column has a mixture of data. She also found that by re-running https://github.com/leekgroup/recount-website/blob/master/metadata/tcga_prep/tcga_clinical.R the problematic column gets fixed. It looks like cgc_case_age_at_diagnosis has the data that is weird in age_at_initial_pathologic_diagnosis.

For now, this will be a known issue while I update the TCGA files in https://github.com/leekgroup/recount-website.

Best, Leo

Unevaluated code

library('recount')
library('devtools')

md <- all_metadata('TCGA')
table(md$xml_age_at_initial_pathologic_diagnosis)
md <- recount::all_metadata('TCGA')
weird <- which(md$xml_age_at_initial_pathologic_diagnosis %in% c('Trigone', 'Wall Anterior', 'Wall Lateral', 'Wall NOS', 'Wall Posterior'))
md[weird, colnames(md)[grep('age', colnames(md))]]

## Reproducibility information
print('Reproducibility information:')
Sys.time()
proc.time()
options(width = 120)
session_info()

Evaluated code

> library('recount')
> library('devtools')
> 
> md <- all_metadata('TCGA')
2017-02-24 13:28:09 downloading the metadata to /var/folders/cx/n9s558kx6fb7jf5z_pgszgb80000gn/T//RtmpSiDQCg/metadata_clean_tcga.Rdata
trying URL 'https://github.com/leekgroup/recount-website/blob/master/metadata/metadata_clean_tcga.Rdata?raw=true'
Content type 'application/octet-stream' length 16351229 bytes (15.6 MB)
==================================================
downloaded 15.6 MB

> table(md$xml_age_at_initial_pathologic_diagnosis)

             0             14             15             16             17             18             19             20             21             22 
            25              1              4              2              4              8             11             17             14             11 
            23             24             25             26             27             28             29             30             31             32 
            22             29             22             28             25             33             40             50             46             46 
            33             34             35             36             37             38             39             40             41             42 
            53             72             67             62             66             89             81            102            100            107 
            43             44             45             46             47             48             49             50             51             52 
           124             99            147            138            159            176            158            165            235            184 
            53             54             55             56             57             58             59             60             61             62 
           219            227            226            240            261            275            274            302            292            296 
            63             64             65             66             67             68             69             70             71             72 
           290            265            276            273            258            270            262            247            231            208 
            73             74             75             76             77             78             79             80             81             82 
           230            229            202            163            161            140            144            114             95             81 
            83             84             85             86             87             88             89             90        Trigone  Wall Anterior 
            64             74             51             28             35             24             10             49              1              6 
  Wall Lateral       Wall NOS Wall Posterior 
            11              1              9 
> md <- recount::all_metadata('TCGA')
2017-02-24 13:28:19 downloading the metadata to /var/folders/cx/n9s558kx6fb7jf5z_pgszgb80000gn/T//RtmpSiDQCg/metadata_clean_tcga.Rdata
trying URL 'https://github.com/leekgroup/recount-website/blob/master/metadata/metadata_clean_tcga.Rdata?raw=true'
Content type 'application/octet-stream' length 16351229 bytes (15.6 MB)
==================================================
downloaded 15.6 MB

> weird <- which(md$xml_age_at_initial_pathologic_diagnosis %in% c('Trigone', 'Wall Anterior', 'Wall Lateral', 'Wall NOS', 'Wall Posterior'))
> md[weird, colnames(md)[grep('age', colnames(md))]]
DataFrame with 28 rows and 39 columns
    gdc_cases.diagnoses.tumor_stage gdc_cases.diagnoses.age_at_diagnosis cgc_case_age_at_diagnosis cgc_case_clinical_stage cgc_case_pathologic_stage
                        <character>                            <numeric>                 <integer>             <character>               <character>
1                          stage ii                                25672                        70                      NA                  Stage II
2                          stage iv                                23236                        63                      NA                  Stage IV
3                          stage iv                                26893                        73                      NA                  Stage IV
4                          stage iv                                26874                        73                      NA                  Stage IV
5                          stage ii                                28204                        77                      NA                  Stage II
...                             ...                                  ...                       ...                     ...                       ...
24                         stage iv                                27963                        76                      NA                  Stage IV
25                        stage iii                                25185                        68                      NA                 Stage III
26                        stage iii                                28328                        77                      NA                 Stage III
27                         stage iv                                27816                        76                      NA                  Stage IV
28                         stage iv                                21196                        58                      NA                  Stage IV
    xml_primary_pathology_age_at_initial_pathologic_diagnosis xml_age_at_initial_pathologic_diagnosis xml_stage_event_system_version
                                                    <integer>                             <character>                    <character>
1                                                          NA                            Wall Lateral                            7th
2                                                          NA                            Wall Lateral                            7th
3                                                          NA                            Wall Lateral                            7th
4                                                          NA                            Wall Lateral                            7th
5                                                          NA                            Wall Lateral                            7th
...                                                       ...                                     ...                            ...
24                                                         NA                          Wall Posterior                            7th
25                                                         NA                            Wall Lateral                            7th
26                                                         NA                          Wall Posterior                            7th
27                                                         NA                           Wall Anterior                            7th
28                                                         NA                           Wall Anterior                            7th
    xml_stage_event_clinical_stage xml_stage_event_pathologic_stage xml_stage_event_tnm_categories xml_stage_event_psa xml_stage_event_gleason_grading
                       <character>                      <character>                    <character>         <character>                       <integer>
1                               NA                         Stage II                        T2aN0MX                  NA                              NA
2                               NA                         Stage IV                       T2T3N2MX                  NA                              NA
3                               NA                         Stage IV                      T2T3aN2MX                  NA                               6
4                               NA                         Stage IV                      T2T4bN1MX                  NA                               7
5                               NA                         Stage II                        T2aN0MX                  NA                              NA
...                            ...                              ...                            ...                 ...                             ...
24                              NA                         Stage IV                        T3bN2M0                  NA                               6
25                              NA                        Stage III                      T2T3bN0MX                  NA                              NA
26                              NA                        Stage III                      T1T3aN0MX                  NA                               7
27                              NA                         Stage IV                        T3bN3M1                  NA                               6
28                              NA                         Stage IV                        T3bN2MX                  NA                               6
    xml_stage_event_ann_arbor xml_stage_event_serum_markers xml_stage_event_igcccg_stage xml_stage_event_masaoka_stage xml_asbestos_exposure_age
                  <character>                   <character>                  <character>                   <character>                 <integer>
1                          NA                            NA                           NA                            NA                        NA
2                          NA                            NA                           NA                            NA                        NA
3                          NA                            NA                           NA                            NA                        NA
4                          NA                            NA                           NA                            NA                        NA
5                          NA                            NA                           NA                            NA                        NA
...                       ...                           ...                          ...                           ...                       ...
24                         NA                            NA                           NA                            NA                        NA
25                         NA                            NA                           NA                            NA                        NA
26                         NA                            NA                           NA                            NA                        NA
27                         NA                            NA                           NA                            NA                        NA
28                         NA                            NA                           NA                            NA                        NA
    xml_asbestos_exposure_age_last xml_birth_control_pill_history_usage_category xml_age_began_smoking_in_years xml_axillary_lymph_node_stage_method_type
                         <integer>                                   <character>                      <integer>                               <character>
1                               NA                                            NA                             12                                        NA
2                               NA                                            NA                             NA                                        NA
3                               NA                                            NA                             NA                                        NA
4                               NA                                            NA                             NA                                        NA
5                               NA                                            NA                             18                                        NA
...                            ...                                           ...                            ...                                       ...
24                              NA                                            NA                             15                                        NA
25                              NA                                            NA                             25                                        NA
26                              NA                                            NA                             NA                                        NA
27                              NA                                            NA                             NA                                        NA
28                              NA                                            NA                             NA                                        NA
    xml_axillary_lymph_node_stage_other_method_descriptive_text xml_er_level_cell_percentage_category xml_history_of_esophageal_cancer
                                                    <character>                           <character>                      <character>
1                                                            NA                                    NA                               NA
2                                                            NA                                    NA                               NA
3                                                            NA                                    NA                               NA
4                                                            NA                                    NA                               NA
5                                                            NA                                    NA                               NA
...                                                         ...                                   ...                              ...
24                                                           NA                                    NA                               NA
25                                                           NA                                    NA                               NA
26                                                           NA                                    NA                               NA
27                                                           NA                                    NA                               NA
28                                                           NA                                    NA                               NA
    xml_primary_pathology_esophageal_tumor_cental_location xml_primary_pathology_esophageal_tumor_involvement_sites
                                               <character>                                              <character>
1                                                       NA                                                       NA
2                                                       NA                                                       NA
3                                                       NA                                                       NA
4                                                       NA                                                       NA
5                                                       NA                                                       NA
...                                                    ...                                                      ...
24                                                      NA                                                       NA
25                                                      NA                                                       NA
26                                                      NA                                                       NA
27                                                      NA                                                       NA
28                                                      NA                                                       NA
    xml_primary_pathology_tumor_infiltrating_macrophages xml_cumulative_agent_total_dose xml_hydroxyurea_agent_administered_day_count
                                             <character>                       <integer>                                    <integer>
1                                                     NA                              NA                                           NA
2                                                     NA                              NA                                           NA
3                                                     NA                              NA                                           NA
4                                                     NA                              NA                                           NA
5                                                     NA                              NA                                           NA
...                                                  ...                             ...                                          ...
24                                                    NA                              NA                                           NA
25                                                    NA                              NA                                           NA
26                                                    NA                              NA                                           NA
27                                                    NA                              NA                                           NA
28                                                    NA                              NA                                           NA
    xml_person_history_nonmedical_leukemia_causing_agent_type xml_lab_procedure_blast_cell_outcome_percentage_value
                                                  <character>                                             <integer>
1                                                          NA                                                    NA
2                                                          NA                                                    NA
3                                                          NA                                                    NA
4                                                          NA                                                    NA
5                                                          NA                                                    NA
...                                                       ...                                                   ...
24                                                         NA                                                    NA
25                                                         NA                                                    NA
26                                                         NA                                                    NA
27                                                         NA                                                    NA
28                                                         NA                                                    NA
    xml_prior_tamoxifen_administered_usage_category xml_radiosensitizing_agent_administered_indicator
                                        <character>                                       <character>
1                                                NA                                                NA
2                                                NA                                                NA
3                                                NA                                                NA
4                                                NA                                                NA
5                                                NA                                                NA
...                                             ...                                               ...
24                                               NA                                                NA
25                                               NA                                                NA
26                                               NA                                                NA
27                                               NA                                                NA
28                                               NA                                                NA
    xml_person_concomitant_prostate_carcinoma_pathologic_t_stage xml_first_diagnosis_age_asth_ecz_hay_fev_mold_dust xml_first_diagnosis_age_of_food_allergy
                                                     <character>                                        <character>                             <character>
1                                                             NA                                                 NA                                      NA
2                                                             NA                                                 NA                                      NA
3                                                             NA                                                 NA                                      NA
4                                                             NA                                                 NA                                      NA
5                                                             NA                                                 NA                                      NA
...                                                          ...                                                ...                                     ...
24                                                            NA                                                 NA                                      NA
25                                                            NA                                                 NA                                      NA
26                                                            NA                                                 NA                                      NA
27                                           7thStage IVT3bN3M16                                                 NA                                      NA
28                                           7thStage IVT3bN2MX6                                                 NA                                      NA
    xml_first_diagnosis_age_of_animal_insect_allergy xml_undescended_testis_corrected_age
                                         <character>                          <character>
1                                                 NA                                   NA
2                                                 NA                                   NA
3                                                 NA                                   NA
4                                                 NA                                   NA
5                                                 NA                                   NA
...                                              ...                                  ...
24                                                NA                                   NA
25                                                NA                                   NA
26                                                NA                                   NA
27                                                NA                                   NA
28                                                NA                                   NA
> 
> ## Reproducibility information
> print('Reproducibility information:')
[1] "Reproducibility information:"
> Sys.time()
[1] "2017-02-24 13:28:27 EST"
> proc.time()
   user  system elapsed 
 20.786   1.699  28.800 
> options(width = 120)
> session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                             
 version  R Under development (unstable) (2016-10-26 r71594)
 system   x86_64, darwin13.4.0                              
 ui       AQUA                                              
 language (EN)                                              
 collate  en_US.UTF-8                                       
 tz       America/New_York                                  
 date     2017-02-24                                        

Packages ---------------------------------------------------------------------------------------------------------------
 package              * version  date       source                            
 acepack                1.4.1    2016-10-29 CRAN (R 3.4.0)                    
 AnnotationDbi          1.37.3   2017-02-09 Bioconductor                      
 assertthat             0.1      2013-12-06 CRAN (R 3.4.0)                    
 backports              1.0.5    2017-01-18 CRAN (R 3.4.0)                    
 base64enc              0.1-3    2015-07-28 CRAN (R 3.4.0)                    
 Biobase              * 2.35.1   2017-02-23 Bioconductor                      
 BiocGenerics         * 0.21.3   2017-01-12 Bioconductor                      
 BiocParallel           1.9.5    2017-01-24 Bioconductor                      
 biomaRt                2.31.4   2017-01-13 Bioconductor                      
 Biostrings             2.43.4   2017-02-02 Bioconductor                      
 bitops                 1.0-6    2013-08-17 CRAN (R 3.4.0)                    
 BSgenome               1.43.5   2017-02-02 Bioconductor                      
 bumphunter             1.15.0   2016-10-23 Bioconductor                      
 checkmate              1.8.2    2016-11-02 CRAN (R 3.4.0)                    
 cluster                2.0.5    2016-10-08 CRAN (R 3.4.0)                    
 codetools              0.2-15   2016-10-05 CRAN (R 3.4.0)                    
 colorspace             1.3-2    2016-12-14 CRAN (R 3.4.0)                    
 data.table             1.10.4   2017-02-01 CRAN (R 3.4.0)                    
 DBI                    0.5-1    2016-09-10 CRAN (R 3.4.0)                    
 DelayedArray         * 0.1.7    2017-02-17 Bioconductor                      
 derfinder              1.9.6    2017-01-13 Bioconductor                      
 derfinderHelper        1.9.3    2016-11-29 Bioconductor                      
 devtools             * 1.12.0   2016-12-05 CRAN (R 3.4.0)                    
 digest                 0.6.12   2017-01-27 CRAN (R 3.4.0)                    
 doRNG                  1.6      2014-03-07 CRAN (R 3.4.0)                    
 downloader             0.4      2015-07-09 CRAN (R 3.4.0)                    
 foreach                1.4.3    2015-10-13 CRAN (R 3.4.0)                    
 foreign                0.8-67   2016-09-13 CRAN (R 3.4.0)                    
 Formula                1.2-1    2015-04-07 CRAN (R 3.4.0)                    
 GenomeInfoDb         * 1.11.9   2017-02-08 Bioconductor                      
 GenomeInfoDbData       0.99.0   2017-02-14 Bioconductor                      
 GenomicAlignments      1.11.9   2017-02-02 Bioconductor                      
 GenomicFeatures        1.27.8   2017-02-11 Bioconductor                      
 GenomicFiles           1.11.3   2016-11-29 Bioconductor                      
 GenomicRanges        * 1.27.22  2017-02-02 Bioconductor                      
 GEOquery               2.41.0   2016-10-25 Bioconductor                      
 ggplot2                2.2.1    2016-12-30 CRAN (R 3.4.0)                    
 gridExtra              2.2.1    2016-02-29 CRAN (R 3.4.0)                    
 gtable                 0.2.0    2016-02-26 CRAN (R 3.4.0)                    
 Hmisc                  4.0-2    2016-12-31 CRAN (R 3.4.0)                    
 htmlTable              1.9      2017-01-26 CRAN (R 3.4.0)                    
 htmltools              0.3.5    2016-03-21 CRAN (R 3.4.0)                    
 htmlwidgets            0.8      2016-11-09 CRAN (R 3.4.0)                    
 httr                   1.2.1    2016-07-03 CRAN (R 3.4.0)                    
 IRanges              * 2.9.18   2017-02-02 Bioconductor                      
 iterators              1.0.8    2015-10-13 CRAN (R 3.4.0)                    
 jsonlite               1.2      2016-12-31 CRAN (R 3.4.0)                    
 knitr                  1.15.1   2016-11-22 CRAN (R 3.4.0)                    
 lattice                0.20-34  2016-09-06 CRAN (R 3.4.0)                    
 latticeExtra           0.6-28   2016-02-09 CRAN (R 3.4.0)                    
 lazyeval               0.2.0    2016-06-12 CRAN (R 3.4.0)                    
 locfit                 1.5-9.1  2013-04-20 CRAN (R 3.4.0)                    
 magrittr               1.5      2014-11-22 CRAN (R 3.4.0)                    
 Matrix                 1.2-8    2017-01-20 CRAN (R 3.4.0)                    
 matrixStats          * 0.51.0   2016-10-09 CRAN (R 3.4.0)                    
 memoise                1.0.0    2016-01-29 CRAN (R 3.4.0)                    
 munsell                0.4.3    2016-02-13 CRAN (R 3.4.0)                    
 nnet                   7.3-12   2016-02-02 CRAN (R 3.4.0)                    
 pkgmaker               0.22     2014-05-14 CRAN (R 3.4.0)                    
 plyr                   1.8.4    2016-06-08 CRAN (R 3.4.0)                    
 qvalue                 2.7.0    2016-10-23 Bioconductor                      
 R6                     2.2.0    2016-10-05 CRAN (R 3.4.0)                    
 RColorBrewer           1.1-2    2014-12-07 CRAN (R 3.4.0)                    
 Rcpp                   0.12.9   2017-01-14 CRAN (R 3.4.0)                    
 RCurl                  1.95-4.8 2016-03-01 CRAN (R 3.4.0)                    
 recount              * 1.1.18   2017-02-22 Github (leekgroup/recount@ced5db4)
 registry               0.3      2015-07-08 CRAN (R 3.4.0)                    
 rentrez                1.0.4    2016-10-26 CRAN (R 3.4.0)                    
 reshape2               1.4.2    2016-10-22 CRAN (R 3.4.0)                    
 rngtools               1.2.4    2014-03-06 CRAN (R 3.4.0)                    
 rpart                  4.1-10   2015-06-29 CRAN (R 3.4.0)                    
 Rsamtools              1.27.12  2017-01-24 Bioconductor                      
 RSQLite                1.1-2    2017-01-08 CRAN (R 3.4.0)                    
 rtracklayer            1.35.6   2017-02-19 cran (@1.35.6)                    
 S4Vectors            * 0.13.15  2017-02-14 cran (@0.13.15)                   
 scales                 0.4.1    2016-11-09 CRAN (R 3.4.0)                    
 stringi                1.1.2    2016-10-01 CRAN (R 3.4.0)                    
 stringr                1.2.0    2017-02-18 CRAN (R 3.4.0)                    
 SummarizedExperiment * 1.5.7    2017-02-23 Bioconductor                      
 survival               2.40-1   2016-10-30 CRAN (R 3.4.0)                    
 tibble                 1.2      2016-08-26 CRAN (R 3.4.0)                    
 VariantAnnotation      1.21.17  2017-02-12 Bioconductor                      
 withr                  1.0.2    2016-06-20 CRAN (R 3.4.0)                    
 XML                    3.98-1.5 2016-11-10 CRAN (R 3.4.0)                    
 xtable                 1.8-2    2016-02-05 CRAN (R 3.4.0)                    
 XVector                0.15.2   2017-02-02 Bioconductor                      
 zlibbioc               1.21.0   2016-10-23 Bioconductor  
lcolladotor commented 7 years ago

Looks like it was a TCGAbiolinks issue rather than an issue with the GDC data from skimming through https://github.com/Bioconductor-mirror/TCGAbiolinks/commit/8a35266df471593939538b5d63d110bfe3daca32.

lcolladotor commented 7 years ago

This has been solved as of today March 1st, 2017