mapToGenome fail to map protein domain

apastore commented 6 years ago

Hi, thanks a lot for this package is extremely useful and save me a lot of code.

I am experiencing some issue when I try to map protein domain to GRCh38. many protein domain get mapped correctly, but for some protein it return a empty GrangeList even if multiple protein domain are presented. One example is human MLLT3.

Do you have any idea?

Thanks a lot!

edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]

hgnc_symbol = "MLLT3"

filter_Proteins = AnnotationFilterList(GenenameFilter(hgnc_symbol))

  PROTEIN_RANGE <- Proteins(edb, filter =  filter_Proteins, 
                            columns = attributes_edB)

  PROTEIN_map <- mapToGenome(PROTEIN_RANGE, edb, idType = "tx_id", id = "tx_id", 
                           drop.empty.ranges = T)

Fetch coding region for proteins ... OK
GRangesList object of length 0:
<0 elements>

-------
seqinfo: no sequences
Warning messages:
1: Mapping failed. Returning an empty range. Last message was: Error in value[[3L]](cond): unused argument (cond)

2: Mapping failed. Returning an empty range. Last message was: Error in value[[3L]](cond): unused argument (cond)

3: Mapping failed. Returning an empty range. Last message was: Error in value[[3L]](cond): unused argument (cond)

seqnames(PROTEIN_RANGE)
[1] "ENSP00000369695" "ENSP00000369695" "ENSP00000485996" "ENSP00000369678" "ENSP00000486193"

> pcols(PROTEIN_RANGE)
DataFrame with 5 rows and 1 column
       ProteinDomains
        <IRangesList>
1 [ 8, 112] [29, 108]
2 [ 8, 112] [29, 108]
3 [ 5, 109] [26, 105]
4                    
5

> acols(PROTEIN_RANGE)
DataFrame with 5 rows and 11 columns
          gene_id   gene_name entrezid      symbol           tx_id    seq_name tx_cds_seq_start tx_cds_seq_end  uniprot_id  uniprot_db
      <character> <character>   <list> <character>     <character> <character>        <integer>      <integer> <character> <character>
1 ENSG00000171843       MLLT3     4300       MLLT3 ENST00000380338           9         20346443       20622256  A0A0S2Z448    SPTREMBL
2 ENSG00000171843       MLLT3     4300       MLLT3 ENST00000380338           9         20346443       20622256      P42568   SWISSPROT
3 ENSG00000171843       MLLT3     4300       MLLT3 ENST00000630269           9         20346443       20621775      P42568   SWISSPROT
4 ENSG00000171843       MLLT3     4300       MLLT3 ENST00000380321           9         20346443       20363588      B1APT5    SPTREMBL
5 ENSG00000171843       MLLT3     4300       MLLT3 ENST00000629733           9         20346443       20363588  A0A0D9SF09    SPTREMBL
  uniprot_mapping_type
           <character>
1       SEQUENCE_MATCH
2               DIRECT
3               DIRECT
4               DIRECT
5               DIRECT

> PROTEIN_RANGE
S4 class type     : Proteins
Class version     : 0.2
Created           : Sat Dec 30 20:11:19 2017
Number of Proteins: 5
Sequences:
  [1] ENSP00000369695 [2] ENSP00000369695 ... [4] ENSP00000369678 [5] ENSP00000486193
Protein ranges:
  ProteinDomains

> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.1
|Creation time: Fri Aug 25 13:15:42 2017
|ensembl_version: 90
|ensembl_host: localhost
|Organism: homo_sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.1
| No. of genes: 64661.
| No. of transcripts: 220144.
|Protein data available.

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationHub_2.10.1                    EnsDb.Hsapiens.v86_2.99.0               org.Hs.eg.db_3.5.0                     
 [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0 BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000  CRISPRseek_1.18.0                      
 [7] Pbase_0.18.0                            Gviz_1.22.2                             Rcpp_0.12.14                           
[10] ensembldb_2.2.0                         AnnotationFilter_1.2.0                  GenomicFeatures_1.30.0                 
[13] AnnotationDbi_1.40.0                    Biobase_2.38.0                          biomaRt_2.34.1                         
[16] BSgenome_1.46.0                         Biostrings_2.46.0                       XVector_0.18.0                         
[19] rtracklayer_1.38.2                      GenomicRanges_1.30.1                    GenomeInfoDb_1.14.0                    
[22] IRanges_2.12.0                          S4Vectors_0.16.0                        BiocGenerics_0.24.0                    

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2              seqinr_3.4-5                  biovizBase_1.26.0             htmlTable_1.11.1             
 [5] base64enc_0.1-3               dichromat_2.0-0               rstudioapi_0.7                mzR_2.12.0                   
 [9] affyio_1.48.0                 hash_2.2.6                    bit64_0.9-7                   interactiveDisplayBase_1.16.0
[13] codetools_0.2-15              splines_3.4.2                 doParallel_1.0.11             impute_1.52.0                
[17] knitr_1.18                    ade4_1.7-10                   Formula_1.2-2                 Rsamtools_1.30.0             
[21] cluster_2.0.6                 vsn_3.46.0                    shiny_1.0.5                   compiler_3.4.2               
[25] httr_1.3.1                    backports_1.1.2               assertthat_0.2.0              Matrix_1.2-12                
[29] lazyeval_0.2.1                limma_3.34.5                  acepack_1.4.1                 htmltools_0.3.6              
[33] prettyunits_1.0.2             tools_3.4.2                   gtable_0.2.0                  GenomeInfoDbData_1.0.0       
[37] affy_1.56.0                   MALDIquant_1.17               preprocessCore_1.40.0         iterators_1.0.9              
[41] stringr_1.2.0                 mime_0.5                      XML_3.98-1.9                  zlibbioc_1.24.0              
[45] MASS_7.3-48                   scales_0.5.0                  MSnbase_2.4.1                 VariantAnnotation_1.24.2     
[49] BiocInstaller_1.28.0          pcaMethods_1.70.0             ProtGenerics_1.10.0           SummarizedExperiment_1.8.1   
[53] RColorBrewer_1.1-2            yaml_2.1.16                   curl_3.1                      memoise_1.1.0                
[57] gridExtra_2.3                 ggplot2_2.2.1                 cleaver_1.16.0                rpart_4.1-11                 
[61] latticeExtra_0.6-28           stringi_1.1.6                 RSQLite_2.0                   foreach_1.4.4                
[65] RMySQL_0.10.13                checkmate_1.8.5               BiocParallel_1.12.0           rlang_0.1.6                  
[69] pkgconfig_2.0.1               matrixStats_0.52.2            bitops_1.0-6                  mzID_1.16.0                  
[73] lattice_0.20-35               GenomicAlignments_1.14.1      htmlwidgets_0.9               bit_1.1-12                   
[77] plyr_1.8.4                    magrittr_1.5                  Pviz_1.12.0                   R6_2.2.2                     
[81] Hmisc_4.0-3                   DelayedArray_0.4.1            DBI_0.7                       pillar_1.0.1                 
[85] foreign_0.8-69                survival_2.41-3               RCurl_1.95-4.9                nnet_7.3-12                  
[89] tibble_1.4.1                  progress_1.1.2                data.table_1.10.4-3           blob_1.1.0                   
[93] digest_0.6.13                 xtable_1.8-2                  httpuv_1.3.5                  munsell_0.4.3

jorainer commented 6 years ago

Dear @apastore , thanks for reporting. In the current Bioconductor development version I've moved (and rewritten) the mapping code from Pbase to ensembldb and this also fixed the error you describe. ensembldb has now 6 functions, proteinToTranscript, proteinToGenome, transcriptToProtein, transcriptToGenome, genomeToTranscript and genomeToProtein to map between protein, transcript and genome coordinates. In order to use these functions you would need to install a current developmental version of R (R-3.5 devel) and install Bioc3.7-devel (using biocLite as you normally would do).

You can do the mapping as described below:

library(ensembldb)
library(AnnotationHub)
edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]

hgnc_symbol = "MLLT3"
## Get proteins along with protein domains.
prts <- proteins(edb, filter = ~ symbol == hgnc_symbol, 
    columns = c("protein_id", "tx_id", listColumns(edb, "protein_domain")))
## You've got a DataFrame, each row representing one protein domain.
prts
DataFrame with 6 rows and 8 columns
       protein_id           tx_id protein_domain_id protein_domain_source
      <character>     <character>       <character>           <character>
1 ENSP00000369678 ENST00000380321                NA                    NA
2 ENSP00000369695 ENST00000380338           PS51037                pfscan
3 ENSP00000369695 ENST00000380338           PF03366                  pfam
4 ENSP00000485996 ENST00000630269           PS51037                pfscan
5 ENSP00000485996 ENST00000630269           PF03366                  pfam
6 ENSP00000486193 ENST00000629733                NA                    NA
  interpro_accession prot_dom_start prot_dom_end      symbol
         <character>      <integer>    <integer> <character>
1                 NA             NA           NA       MLLT3
2          IPR005033              8          112       MLLT3
3          IPR005033             29          108       MLLT3
4          IPR005033              5          109       MLLT3
5          IPR005033             26          105       MLLT3
6                 NA             NA           NA       MLLT3

## Remove proteins without protein domains
prts <- prts[!is.na(prts$protein_domain_id), ]

## Define an IRanges representing the protein domains (their start and end coords
## within the protein sequence) and use the protein ID as their names
dmns <- IRanges(start = prts$prot_dom_start, end = prts$prot_dom_end, 
    names = prts$protein_id)

## Now map the domains to the genome
dmns_gnm <- proteinToGenome(dmns, edb)

## The result is a list of GRanges, each GRanges representing the genomic coordinates of
## the protein domain
length(dmns_gnm)
[1] 4

## The genomic coordinates for the first protein domain
dmns_gnm[[1]]
GRanges object with 3 ranges and 7 metadata columns:
      seqnames               ranges strand |      protein_id           tx_id
         <Rle>            <IRanges>  <Rle> |     <character>     <character>
  [1]        9 [20620654, 20620825]      - | ENSP00000369695 ENST00000380338
  [2]        9 [20456704, 20456786]      - | ENSP00000369695 ENST00000380338
  [3]        9 [20448207, 20448266]      - | ENSP00000369695 ENST00000380338
              exon_id exon_rank    cds_ok protein_start protein_end
          <character> <integer> <logical>     <integer>   <integer>
  [1] ENSE00003665608         2      TRUE             8         112
  [2] ENSE00003628570         3      TRUE             8         112
  [3] ENSE00003670233         4      TRUE             8         112
  -------
  seqinfo: 1 sequence from GRCh38 genome

Have also a look at the new coordinate_mapping vignette of ensembldb for more details.

My sessionInfo:

> sessionInfo()
R Under development (unstable) (2017-12-14 r73916)
Platform: x86_64-apple-darwin17.3.0/x86_64 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] AnnotationHub_2.11.2   ensembldb_2.3.7        AnnotationFilter_1.3.0
 [4] GenomicFeatures_1.31.1 AnnotationDbi_1.41.4   Biobase_2.39.1        
 [7] GenomicRanges_1.31.3   GenomeInfoDb_1.15.1    IRanges_2.13.4        
[10] S4Vectors_0.17.16      BiocGenerics_0.25.1   

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.9.5    progress_1.1.2               
 [3] lattice_0.20-35               htmltools_0.3.6              
 [5] rtracklayer_1.39.4            yaml_2.1.16                  
 [7] interactiveDisplayBase_1.17.0 blob_1.1.0                   
 [9] XML_3.98-1.9                  rlang_0.1.6                  
[11] pillar_1.0.1                  DBI_0.7                      
[13] BiocParallel_1.13.1           bit64_0.9-7                  
[15] matrixStats_0.52.2            GenomeInfoDbData_1.1.0       
[17] stringr_1.2.0                 zlibbioc_1.25.0              
[19] ProtGenerics_1.11.0           Biostrings_2.47.1            
[21] memoise_1.1.0                 biomaRt_2.35.5               
[23] BiocInstaller_1.29.3          httpuv_1.3.5                 
[25] curl_3.1                      Rcpp_0.12.14                 
[27] xtable_1.8-2                  DelayedArray_0.5.15          
[29] XVector_0.19.1                mime_0.5                     
[31] bit_1.1-12                    Rsamtools_1.31.1             
[33] RMySQL_0.10.13                digest_0.6.13                
[35] stringi_1.1.6                 shiny_1.0.5                  
[37] grid_3.5.0                    tools_3.5.0                  
[39] bitops_1.0-6                  magrittr_1.5                 
[41] RCurl_1.95-4.10               lazyeval_0.2.1               
[43] tibble_1.4.1                  RSQLite_2.0                  
[45] pkgconfig_2.0.1               Matrix_1.2-12                
[47] prettyunits_1.0.2             assertthat_0.2.0             
[49] httr_1.3.1                    R6_2.2.2                     
[51] GenomicAlignments_1.15.2      compiler_3.5.0

apastore commented 6 years ago

Thanks! I have seen the ensembldb devel version. I will try that.

On Jan 8, 2018, at 5:00 AM, Johannes Rainer notifications@github.com wrote:

Dear @apastore https://github.com/apastore , thanks for reporting. In the current Bioconductor development version I've moved (and rewritten) the mapping code from Pbase to ensembldb and this also fixed the error you describe. ensembldb has now 6 functions, proteinToTranscript, proteinToGenome, transcriptToProtein, transcriptToGenome, genomeToTranscript and genomeToProtein to map between protein, transcript and genome coordinates. In order to use these functions you would need to install a current developmental version of R (R-3.5 devel) and install Bioc3.7-devel (using biocLite as you normally would do).

You can do the mapping as described below:

library(ensembldb) library(AnnotationHub) edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]

hgnc_symbol = "MLLT3"

Get proteins along with protein domains.

prts <- proteins(edb, filter = ~ symbol == hgnc_symbol, columns = c("protein_id", "tx_id", listColumns(edb, "protein_domain")))

You've got a DataFrame, each row representing one protein domain.

prts DataFrame with 6 rows and 8 columns protein_id tx_id protein_domain_id protein_domain_source
1 ENSP00000369678 ENST00000380321 NA NA 2 ENSP00000369695 ENST00000380338 PS51037 pfscan 3 ENSP00000369695 ENST00000380338 PF03366 pfam 4 ENSP00000485996 ENST00000630269 PS51037 pfscan 5 ENSP00000485996 ENST00000630269 PF03366 pfam 6 ENSP00000486193 ENST00000629733 NA NA interpro_accession prot_dom_start prot_dom_end symbol 1 NA NA NA MLLT3 2 IPR005033 8 112 MLLT3 3 IPR005033 29 108 MLLT3 4 IPR005033 5 109 MLLT3 5 IPR005033 26 105 MLLT3 6 NA NA NA MLLT3 ## Remove proteins without protein domains prts <- prts[!is.na(prts$protein_domain_id), ] ## Define an IRanges representing the protein domains (their start and end coords ## within the protein sequence) and use the protein ID as their names dmns <- IRanges(start = prts$prot_dom_start, end = prts$prot_dom_end, names = prts$protein_id) ## Now map the domains to the genome dmns_gnm <- proteinToGenome(dmns, edb) ## The result is a list of GRanges, each GRanges representing the genomic coordinates of ## the protein domain length(dmns_gnm) [1] 4 ## The genomic coordinates for the first protein domain dmns_gnm[[1]] GRanges object with 3 ranges and 7 metadata columns: seqnames ranges strand | protein_id tx_id | [1] 9 [20620654, 20620825] - | ENSP00000369695 ENST00000380338 [2] 9 [20456704, 20456786] - | ENSP00000369695 ENST00000380338 [3] 9 [20448207, 20448266] - | ENSP00000369695 ENST00000380338 exon_id exon_rank cds_ok protein_start protein_end [1] ENSE00003665608 2 TRUE 8 112 [2] ENSE00003628570 3 TRUE 8 112 [3] ENSE00003670233 4 TRUE 8 112 ------- seqinfo: 1 sequence from GRCh38 genome Have also a look at the new coordinate_mapping vignette of ensembldb for more details. My sessionInfo: > sessionInfo() R Under development (unstable) (2017-12-14 r73916) Platform: x86_64-apple-darwin17.3.0/x86_64 (64-bit) Running under: macOS High Sierra 10.13.3 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] AnnotationHub_2.11.2 ensembldb_2.3.7 AnnotationFilter_1.3.0 [4] GenomicFeatures_1.31.1 AnnotationDbi_1.41.4 Biobase_2.39.1 [7] GenomicRanges_1.31.3 GenomeInfoDb_1.15.1 IRanges_2.13.4 [10] S4Vectors_0.17.16 BiocGenerics_0.25.1 loaded via a namespace (and not attached): [1] SummarizedExperiment_1.9.5 progress_1.1.2 [3] lattice_0.20-35 htmltools_0.3.6 [5] rtracklayer_1.39.4 yaml_2.1.16 [7] interactiveDisplayBase_1.17.0 blob_1.1.0 [9] XML_3.98-1.9 rlang_0.1.6 [11] pillar_1.0.1 DBI_0.7 [13] BiocParallel_1.13.1 bit64_0.9-7 [15] matrixStats_0.52.2 GenomeInfoDbData_1.1.0 [17] stringr_1.2.0 zlibbioc_1.25.0 [19] ProtGenerics_1.11.0 Biostrings_2.47.1 [21] memoise_1.1.0 biomaRt_2.35.5 [23] BiocInstaller_1.29.3 httpuv_1.3.5 [25] curl_3.1 Rcpp_0.12.14 [27] xtable_1.8-2 DelayedArray_0.5.15 [29] XVector_0.19.1 mime_0.5 [31] bit_1.1-12 Rsamtools_1.31.1 [33] RMySQL_0.10.13 digest_0.6.13 [35] stringi_1.1.6 shiny_1.0.5 [37] grid_3.5.0 tools_3.5.0 [39] bitops_1.0-6 magrittr_1.5 [41] RCurl_1.95-4.10 lazyeval_0.2.1 [43] tibble_1.4.1 RSQLite_2.0 [45] pkgconfig_2.0.1 Matrix_1.2-12 [47] prettyunits_1.0.2 assertthat_0.2.0 [49] httr_1.3.1 R6_2.2.2 [51] GenomicAlignments_1.15.2 compiler_3.5.0 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

lgatto / Pbase

mapToGenome fail to map protein domain #48

Get proteins along with protein domains.

You've got a DataFrame, each row representing one protein domain.