jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Warning: Columns 'name', 'value' are not present in the database and have been removed #128

Open lgatto opened 2 years ago

lgatto commented 2 years ago
> library(ensembldb)
*** output flushed ***
> library(AnnotationHub)
*** output flushed ***
> ah <- AnnotationHub()
snapshotDate(): 2021-12-20
> edb105 <- ah[["AH98047"]]
loading from cache
> ensembldb:::cleanColumns(edb105, listColumns(edb105))
 [1] "seq_name"              "seq_length"            "is_circular"          
 [4] "gene_id"               "entrezid"              "exon_id"              
 [7] "exon_seq_start"        "exon_seq_end"          "gene_name"            
[10] "gene_biotype"          "gene_seq_start"        "gene_seq_end"         
[13] "seq_strand"            "seq_coord_system"      "description"          
[16] "gene_id_version"       "canonical_transcript"  "symbol"               
[19] "tx_id"                 "protein_id"            "protein_sequence"     
[22] "protein_domain_id"     "protein_domain_source" "interpro_accession"   
[25] "prot_dom_start"        "prot_dom_end"          "tx_biotype"           
[28] "tx_seq_start"          "tx_seq_end"            "tx_cds_seq_start"     
[31] "tx_cds_seq_end"        "tx_support_level"      "tx_id_version"        
[34] "gc_content"            "tx_external_name"      "tx_is_canonical"      
[37] "tx_name"               "exon_idx"              "uniprot_id"           
[40] "uniprot_db"            "uniprot_mapping_type" 
Warning message:
In ensembldb:::cleanColumns(edb105, listColumns(edb105)) :
  Columns 'name', 'value' are not present in the database and have been removed

with

> sessionInfo()
R Under development (unstable) (2021-11-10 r81172)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] AnnotationHub_3.3.7     BiocFileCache_2.3.3     dbplyr_2.1.1           
 [4] ensembldb_2.19.6        AnnotationFilter_1.19.0 GenomicFeatures_1.47.5 
 [7] AnnotationDbi_1.57.1    Biobase_2.55.0          GenomicRanges_1.47.5   
[10] GenomeInfoDb_1.31.1     IRanges_2.29.1          S4Vectors_0.33.8       
[13] BiocGenerics_0.41.2    

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.7.0          httr_1.4.2                   
 [3] bit64_4.0.5                   shiny_1.7.1                  
 [5] assertthat_0.2.1              interactiveDisplayBase_1.33.0
 [7] BiocManager_1.30.16           blob_1.2.2                   
 [9] GenomeInfoDbData_1.2.7        Rsamtools_2.11.0             
[11] yaml_2.2.1                    progress_1.2.2               
[13] BiocVersion_3.15.0            pillar_1.6.4                 
[15] RSQLite_2.2.9                 lattice_0.20-45              
[17] glue_1.6.0                    digest_0.6.29                
[19] promises_1.2.0.1              XVector_0.35.0               
[21] httpuv_1.6.4                  htmltools_0.5.2              
[23] Matrix_1.3-4                  XML_3.99-0.8                 
[25] pkgconfig_2.0.3               biomaRt_2.51.1               
[27] zlibbioc_1.41.0               xtable_1.8-4                 
[29] purrr_0.3.4                   later_1.3.0                  
[31] BiocParallel_1.29.8           tibble_3.1.6                 
[33] KEGGREST_1.35.0               generics_0.1.1               
[35] ellipsis_0.3.2                withr_2.4.3                  
[37] cachem_1.0.6                  SummarizedExperiment_1.25.3  
[39] lazyeval_0.2.2                mime_0.12                    
[41] magrittr_2.0.1                crayon_1.4.2                 
[43] memoise_2.0.1                 fansi_0.5.0                  
[45] xml2_1.3.3                    tools_4.2.0                  
[47] prettyunits_1.1.1             hms_1.1.1                    
[49] BiocIO_1.5.0                  lifecycle_1.0.1              
[51] matrixStats_0.61.0            stringr_1.4.0                
[53] DelayedArray_0.21.2           Biostrings_2.63.0            
[55] compiler_4.2.0                rlang_0.4.12                 
[57] grid_4.2.0                    RCurl_1.98-1.5               
[59] rjson_0.2.20                  rappdirs_0.3.3               
[61] bitops_1.0-7                  restfulr_0.0.13              
[63] DBI_1.1.2                     curl_4.3.2                   
[65] R6_2.5.1                      GenomicAlignments_1.31.2     
[67] dplyr_1.0.7                   rtracklayer_1.55.3           
[69] fastmap_1.1.0                 bit_4.0.4                    
[71] utf8_1.2.2                    filelock_1.0.2               
[73] ProtGenerics_1.27.1           stringi_1.7.6                
[75] parallel_4.2.0                Rcpp_1.0.7                   
[77] vctrs_0.3.8                   png_0.1-7                    
[79] tidyselect_1.1.1             
jorainer commented 2 years ago

The columns "name" and "value" are from the metadata database table which is generally not used in any query in the database. Where did you get this warning message (I assume you did not specifically call the cleanColumns when you first saw the message)?

lgatto commented 2 years ago

I suppose I wanted a way to get all possible columns and was surprised to get a warning when using clear (although un-exported) functions. Is there another way to do that?

jorainer commented 2 years ago

No, listColumns is actually the correct function to list all available database columns - maybe listTables might be even better because it tells which columns are in which table. I could fix the listColumns to not list columns from the metadata table because that table will usually not be queried anyway.

lgatto commented 2 years ago

Yes, that sounds reasonable. Or at least not throw a warning but a simple message, if you think that's warranted.

lgatto commented 2 years ago

On a similar note:

> ens <- proteins(edb105, listColumns(edb105))
Warning messages:
1: In cleanColumns(object, unique(c(columns, "protein_id"))) :
  Columns 'name', 'value' are not present in the database and have been removed
2: In .local(object, ...) :
  Exon specific columns are not allowed for proteins. Columns 'exon_id', 'exon_seq_start', 'exon_seq_end', 'exon_idx' have been removed.

Is the second warning warranted? As above, I should be able to get all columns for proteins without triggering a warning.

jorainer commented 2 years ago

I think the second warning is OK. listColumns lists all database columns, but for protein annotations it makes no sense to also return exon coordinates - that would blow up the results (and in addition the join query would be rather complex and the query would eventually take very long).