jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

The output from proteinToGenome has empty value #157

Closed zyh4482 closed 1 month ago

zyh4482 commented 3 months ago

As an example, I would like to map the 1-28 amino acid of ENSP00000518892 back to its genomic coordinates. According to biomaRt query result, the respective transcript sequence should be:

ATGCCCAGGGCTCAGTTGCCTGAGGACAGCAGTGCAGTTGACATGGATATTCTCTTTCCTCTGGACAGTGTTATTGGTACAGAGCTGTGCCCCAGCCCCATTCCCCAGATCATCCATTTCGTCCTCTTTGTTGTGTTCAGCCTGGTGATCCTGATTATCTTACGCCTCTACATTCCCAGGGAGCCGTCCTCAGTGCCTCCCAGAGAGGAGGACAGCGAGAATGATCAAGCTGAAGTGGGGGAATGGCTCAGGATCGGAAATAAATATATCACTTTGAAAGATTACAGAATTCTCTTGAAAGAACTGGAGAACCTTGAGATCTACACTTTCCTGTCGAAAAAGTGCCTGAAGAAGCTCTCTAGGGAGGGCAGCTCCCATCACCTTCCACGCCAAGTCCGCCCAGGGCCAGTGTACAAACCAGCACCTGCTAGGAACCACCGGCCACGTGGGGGGCGTGGGAAAGCTTCTCCCACCAGCTTCCATGTGTCCCCACGGGCTCCCCTGGCTCCTCTGGCCTCCATGCCGTCATCAGTCCCGAAGACCTCCGTAGAGTCCTTGGGGTCTCCATCATCCCTGAGCTCCTCCAAGCCACGAGAGCCTCTGTGTCCCCTGAAGCACCCTTCACACCAGCCACCTGCGAGCACCCTATCACCAAACCCGACCAGCTCCACAGAATCCTTGGGGTATCTGTCATCCCTGAGCTCCTCCCAGCCACCAGAGCCTTTGCGTCCCCTGAAGCACCCTTCACACAAGCCACGTGGGCGTTCCCTTCCCCGACGACGGAATCCTGGCTGGGTGTCCTGGTCCGACTCCATGCAGGCTGATTCCGAAACTGACACCATAATATGCCCAATGTGCAAGGCCCCTGAGCGCTCCTGTCCACACACCTGGTGGGTGCCTTCTAGCCCTCGAGTGATCCGAGGCGTTGGTCGCTGCAGTGATCCCAACCTGGGCCTCTCCTGGAGGCAGGAGGCTGCTAGAGCCTGGTGCCACTGCACCTCCTCACAGTTCCCATTCAAGCACCCTAATCTTCCCACCCACCTACCAAAGGCTTCCTTCTAG

The peptide sequence is as below:

MPRAQLPEDSSAVDMDILFPLDSVIGTELCPSPIPQIIHFVLFVVFSLVILIILRLYIPREPSSVPPREEDSENDQAEVGEWLRIGNKYITLKDYRILLKELENLEIYTFLSKKCLKKLSREGSSHHLPRQVRPGPVYKPAPARNHRPRGGRGKASPTSFHVSPRAPLAPLASMPSSVPKTSVESLGSPSSLSSSKPREPLCPLKHPSHQPPASTLSPNPTSSTESLGYLSSLSSSQPPEPLRPLKHPSHKPRGRSLPRRRNPGWVSWSDSMQADSETDTIICPMCKAPERSCPHTWWVPSSPRVIRGVGRCSDPNLGLSWRQEAARAWCHCTSSQFPFKHPNLPTHLPKASF

But I cannot get this query from the proteinToGenome output. It returns an empty record:

> genoloci.ah[["ENSP00000518892"]]
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

May I ask why this happens? I don't understand why the sequence can all be found in the Ensembl database but cannot be mapped using proteinToGenome.

Thank you.

jorainer commented 2 months ago

I tried with current Bioconductor package versions (3.19) and Ensembl release 112 and it seems to work:

library(ensembldb)
library(AnnotationHub)
edb <- ah[["AH116860"]]
rng <- IRanges(start = 1, end = 28)
names(rng) <- "ENSP00000518892"
proteinToGenome(rng, edb)
Fetching CDS for 1 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/1 OK
$ENSP00000518892
GRanges object with 1 range and 7 metadata columns:
      seqnames              ranges strand |      protein_id           tx_id
         <Rle>           <IRanges>  <Rle> |     <character>     <character>
  [1]        5 179644937-179645020      - | ENSP00000518892 ENST00000511063
              exon_id exon_rank    cds_ok protein_start protein_end
          <character> <integer> <logical>     <integer>   <integer>
  [1] ENSE00003874488         1      TRUE             1          28
  -------
  seqinfo: 1 sequence from GRCh38 genome

maybe you used a different Ensembl release?

mschubert commented 1 month ago

I stumbled over this yesterday too: The reason was that it wasn't obvious to me that the ranges need to have protein ID names.

The names are set in the Examples section of ?proteinToGenome, but it would have saved me a bit of time if either:

  1. That the records should be named with protein IDs was stated in the Description field
  2. An IRanges object with positive length and no names would raise an error
jorainer commented 1 month ago

thanks for the comment @mschubert ! When you write

was stated in the Description field

do you mean the help page of the proteinToGenome() function?

Regarding the second point, yes that makes sense. I'll look into it - and see if that has no other unwanted effects.

zyh4482 commented 1 month ago

@mschubert This is quite unexpected... I've never thought about this before. Thank you!

mschubert commented 1 month ago

do you mean the help page of the proteinToGenome() function?

Yes, exactly. I see you already added that, great, thank you!

jorainer commented 1 month ago

I'm closing this issue now - feel free to reopen if needed.