jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

ensDbFromGtf refuses to parse GTF file #75

Closed plijnzaad closed 6 years ago

plijnzaad commented 6 years ago

I try to parse a valid GTF file that does not adhere to some naming conventions as follows:

edbSqliteFile <- ensDbFromGtf(gtf = 'myfile.gtf',
                              organism='Homo sapiens',
                              genomeVersion='GRCh38',
                              version='88'
                              )

, I get:

Error in if (genomeVersion != Header[Header[, "name"] == "genome-version",  :
  argument is of length zero

This is rather annoying: this is, as far as I'm know, a valid GTF file, and I don't want (or may not be able) to edit all my GTF files (or to create symlinks to implement another naming scheme) just to comply with a naming scheme (which, incidentally, I could not find anywhere!).

Moreover, the source code of function .checkExtractVersions says:

    if (!missing(organism)) {
        if (!is.na(orgFromFile)) {
            if (organism != orgFromFile) {
                warning("User specified organism (", organism,
                  ") is different to the one extracted", " from the file name (",
                  orgFromFile, ")! Using the one defined by the user.")
            }
        }
        orgFromFile <- organism
    }

and analogous checks for the genomeVersion and version arguments. They basically just warn the user about overriding the organism (etc.) but only if it was encoded as part of the filename. If organism (etc.) are in the header, the parsing simply dies (this ought to be a warning, just like with the file naming convention), and if they are not in the header, R crashes. Or am I misunderstanding something? I'll issue a pull request that solves these to problems.

The output from sessionInfo() is appended. Kind regards,

Philip

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /opt/local/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.dylib
LAPACK: /opt/local/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
 [1] grid      stats4    parallel  stats     datasets  graphics  grDevices
 [8] utils     methods   base     

other attached packages:
 [1] chimeraviz_1.4.1       ensembldb_2.2.1        AnnotationFilter_1.3.1
 [4] GenomicFeatures_1.30.3 AnnotationDbi_1.40.0   Biobase_2.38.0        
 [7] Gviz_1.22.2            GenomicRanges_1.30.1   GenomeInfoDb_1.14.0   
[10] Biostrings_2.46.0      XVector_0.18.0         IRanges_2.12.0        
[13] S4Vectors_0.16.0       BiocGenerics_0.24.0    uuutils_1.48          
[16] gplots_3.0.1          

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.10.0           bitops_1.0-6                 
 [3] matrixStats_0.53.1            bit64_0.9-7                  
 [5] RColorBrewer_1.1-2            progress_1.1.2               
 [7] httr_1.3.1                    rprojroot_1.3-2              
 [9] tools_3.4.3                   backports_1.1.2              
[11] DT_0.4                        R6_2.2.2                     
[13] rpart_4.1-11                  KernSmooth_2.23-15           
[15] Hmisc_4.1-1                   DBI_0.7-15                   
[17] lazyeval_0.2.1                colorspace_1.3-2             
[19] nnet_7.3-12                   gridExtra_2.3                
[21] prettyunits_1.0.2             RMySQL_0.10.13               
[23] curl_3.1                      bit_1.1-12                   
[25] compiler_3.4.3                htmlTable_1.11.2             
[27] DelayedArray_0.4.1            rtracklayer_1.38.3           
[29] caTools_1.17.1                scales_0.5.0                 
[31] checkmate_1.8.5               readr_1.1.1                  
[33] RCircos_1.2.0                 stringr_1.2.0                
[35] digest_0.6.15                 Rsamtools_1.30.0             
[37] foreign_0.8-69                rmarkdown_1.8                
[39] pkgconfig_2.0.1               base64enc_0.1-3              
[41] dichromat_2.0-0               htmltools_0.3.6              
[43] BSgenome_1.46.0               htmlwidgets_1.0              
[45] rlang_0.1.6                   rstudioapi_0.7               
[47] RSQLite_2.0                   BiocInstaller_1.28.0         
[49] shiny_1.0.5                   BiocParallel_1.12.0          
[51] gtools_3.5.0                  acepack_1.4.1                
[53] VariantAnnotation_1.24.5      RCurl_1.95-4.10              
[55] magrittr_1.5                  GenomeInfoDbData_1.0.0       
[57] Formula_1.2-2                 Matrix_1.2-12                
[59] Rcpp_0.12.15                  munsell_0.4.3                
[61] stringi_1.1.6                 yaml_2.1.16                  
[63] SummarizedExperiment_1.8.1    zlibbioc_1.24.0              
[65] org.Hs.eg.db_3.5.0            plyr_1.8.4                   
[67] AnnotationHub_2.10.1          blob_1.1.0                   
[69] gdata_2.18.0                  lattice_0.20-35              
[71] splines_3.4.3                 hms_0.4.1                    
[73] knitr_1.19                    pillar_1.1.0                 
[75] biomaRt_2.34.2                XML_3.98-1.9                 
[77] evaluate_0.10.1               biovizBase_1.26.0            
[79] latticeExtra_0.6-28           data.table_1.10.4-3          
[81] httpuv_1.3.5                  gtable_0.2.0                 
[83] assertthat_0.2.0              ggplot2_2.2.1                
[85] mime_0.5                      xtable_1.8-2                 
[87] survival_2.41-3               tibble_1.4.2                 
[89] GenomicAlignments_1.14.1      memoise_1.1.0                
[91] cluster_2.0.6                 interactiveDisplayBase_1.16.0
[93] BiocStyle_2.6.1              
plijnzaad commented 6 years ago

PS: this was the 2.2.1 version, but the issue is identical in the latest commit on master

commit 8c541e26194e2f0d1c2f8f537720765145468fcb Author: jotsetung johannes.rainer@gmail.com Date: Tue Feb 13 09:09:36 2018 +0100

jorainer commented 6 years ago

Thanks for reporting @plijnzaad . Yes, you should only get a warning and no error. Could you please provide the first few lines of the gtf file (say the first 10 or so), so I could check/fix the code doing the header processing?

plijnzaad commented 6 years ago

Hi Rainer,

sorry, I did not see that you had already responded (while I was writing the patch :-). I tried to upload a sample GTF file, but just then our network failed so I had to go home and have dinner, only to find out that you had already merged the pull request. Nice!

jorainer commented 6 years ago

I'll do some more tests and also include the fix in the release version. I'll let you know when you can install and test the fixes.

jorainer commented 6 years ago

OK @plijnzaad, I've pushed the fixes to BioC3.6 and BioC3.7-devel. It will take some time until the changes are propagated, but you can install the version 2.2.2 (soon in BioC3.6) from github:

devtools::install_github("jotsetung/ensembldb", "RELEASE_3_6")

let me know if it works.

plijnzaad commented 6 years ago

Hi, thanks, slight problem:

** preparing package for lazy loading
Error : package 'GenomicRanges' 1.30.1 was found, but >= 1.31.18 is required by 'ensembldb'
ERROR: lazy loading failed for package 'ensembldb'
* removing '/Users/philip/Rlibs-3.3.0/ensembldb'
* restoring previous '/Users/philip/Rlibs-3.3.0/ensembldb'
Installation failed: Command failed (1)

So for now I'll stick to my own version (which was the one closest to the 2.2.1 release) because I don't want to jump in the Bioconductor upgrading treadmill ... Cheers, Philip

PS: my location says Rlibs-3.3.0, but it's lying, my R.version is 3.4.3 :-)

jorainer commented 6 years ago

Sorry, the R-command above was not correct. you should call:

devtools::install_github("jotsetung/ensembldb", ref = "RELEASE_3_6")

than it is installing the correct version for BioC3.6. Before (without passing the branch name as a named argument ref = ...) it was installing the master branch.