lgatto / Pbase

Manipluating and exploring protein and proteomics data
8 stars 3 forks source link

Mapping to genome fails without apparent reason #46

Open jorainer opened 6 years ago

jorainer commented 6 years ago

Got the following example

> library(ensembldb)
> library(AnnotationHub)
> edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]
> library(Pbase)
> ## Fetch the protein ENSP00000269305
> prot <- Proteins(edb, filter = ~ protein_id == "ENSP00000269305")
> ## Mapping the protein domains to the genome works
> mapToGenome(prot, edb)
Fetch coding region for proteins ... OK
GRangesList object of length 1:
$ENSP00000269305 
GRanges object with 26 ranges and 5 metadata columns:
           seqnames             ranges strand |           tx_id
              <Rle>          <IRanges>  <Rle> |     <character>
  SSF47719       17 [7670638, 7670715]      - | ENST00000269305
   PF07710       17 [7670638, 7670715]      - | ENST00000269305
   PR00386       17 [7670659, 7670715]      - | ENST00000269305
   PR00386       17 [7673535, 7673552]      - | ENST00000269305
  SSF47719       17 [7673535, 7673573]      - | ENST00000269305
       ...      ...                ...    ... .             ...
   PR00386       17 [7675994, 7676023]      - | ENST00000269305
  SSF49417       17 [7675994, 7676080]      - | ENST00000269305
   PF00870       17 [7675994, 7676086]      - | ENST00000269305
   PF08563       17 [7676391, 7676403]      - | ENST00000269305
   PF08563       17 [7676521, 7676582]      - | ENST00000269305
                                                                                                                                                                                                       pepseq
                                                                                                                                                                                                  <character>
  SSF47719                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
   PF07710                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
   PR00386                                                                                                                                                                          EYFTLQIRGRERFEMFRELNEALEL
   PR00386                                                                                                                                                                          EYFTLQIRGRERFEMFRELNEALEL
  SSF47719                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
       ...                                                                                                                                                                                                ...
   PR00386                                                                                                                                                                        SGTAKSVTCTYSPALNKMFCQLAKTCP
  SSF49417    VPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEE
   PF00870 SSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEEN
   PF08563                                                                                                                                                                          QSDPSVEPPLSQETFSDLWKLLPEN
   PF08563                                                                                                                                                                          QSDPSVEPPLSQETFSDLWKLLPEN
                 accession exonJunctions     group
               <character>     <logical> <integer>
  SSF47719 ENSP00000269305          TRUE         2
   PF07710 ENSP00000269305          TRUE         4
   PR00386 ENSP00000269305          TRUE         7
   PR00386 ENSP00000269305          TRUE         7
  SSF47719 ENSP00000269305          TRUE         2
       ...             ...           ...       ...
   PR00386 ENSP00000269305          TRUE         8
  SSF49417 ENSP00000269305          TRUE         3
   PF00870 ENSP00000269305          TRUE         5
   PF08563 ENSP00000269305          TRUE         6
   PF08563 ENSP00000269305          TRUE         6

-------
seqinfo: 1 sequence from GRCh38 genome

So, mapping works. But if a single IRanges is added it fails:

> ## Replacing/adding ranges within the protein is a little tedious at present -
> irl <- IRangesList(ENSP00000269305 = IRanges(start = 281, end = 391))
> mcols(prot@aa)$region <- irl
> mapToGenome(prot, edb, pcol = "region")
Fetch coding region for proteins ... OK
GRangesList object of length 0:
<0 elements>

-------
seqinfo: no sequences
Warning message:
Mapping failed. Returning an empty range. Last message was: Error in value[[3L]](cond): unused argument (cond)

Interstingly, it works if one of the protein domains is selected:

> irl <- IRangesList(ENSP00000269305 = IRanges(start = 237, end = 249))
> mcols(prot@aa)$region <- irl
> mapToGenome(prot, edb, pcol = "region")
Fetch coding region for proteins ... OK
GRangesList object of length 1:
$ENSP00000269305 
GRanges object with 1 range and 5 metadata columns:
      seqnames             ranges strand |           tx_id        pepseq
         <Rle>          <IRanges>  <Rle> |     <character>   <character>
  [1]       17 [7674216, 7674254]      - | ENST00000269305 MCNSSCMGGMNRR
            accession exonJunctions     group
          <character>     <logical> <integer>
  [1] ENSP00000269305         FALSE         1

-------
seqinfo: 1 sequence from GRCh38 genome
jorainer commented 6 years ago

I've rewritten big part of the mapping functionality (based on the original code in Pbase) in ensembldb and now it works nicely:

> library(ensembldb)
> library(AnnotationHub)
> edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]
snapshotDate(): 2017-10-27
downloading 0 resources
loading from cache 
    '/Users/jo//.AnnotationHub/64495'

> ## Define the region within the protein
> rng <- IRanges(start = 281, end = 391, names = "ENSP00000269305")
> ## Now map the coordinates
> proteinToGenome(rng, edb)
Fetching CDS for 1 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/1 OK
$ENSP00000269305
GRanges object with 4 ranges and 7 metadata columns:
      seqnames             ranges strand |      protein_id           tx_id
         <Rle>          <IRanges>  <Rle> |     <character>     <character>
  [1]       17 [7673701, 7673779]      - | ENSP00000269305 ENST00000269305
  [2]       17 [7673535, 7673608]      - | ENSP00000269305 ENST00000269305
  [3]       17 [7670609, 7670715]      - | ENSP00000269305 ENST00000269305
  [4]       17 [7669618, 7669690]      - | ENSP00000269305 ENST00000269305
              exon_id exon_rank    cds_ok protein_start protein_end
          <character> <integer> <logical>     <integer>   <integer>
  [1] ENSE00003725258         8      TRUE           281         391
  [2] ENSE00003786593         9      TRUE           281         391
  [3] ENSE00003545950        10      TRUE           281         391
  [4] ENSE00003605891        11      TRUE           281         391
  -------
  seqinfo: 1 sequence from GRCh38 genome