How to keep Protein ID when retrieving coding sequences

santiagoha commented 7 years ago

I have a bunch of protein IDs and I need to retrieve the corresponding coding sequences (CDSs). I have managed to retrieve the CDSs but the names of each sequence change from XP to XM, and I need to retain the XP* header for each sequence.

Basically it looks like this:

library(renters)
search1 <- entrez_search(db="protein", term="XP_012370245[Accn]")
links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])

And the output looks like this:

> rec
[1] ">XM_012514791.1 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
\nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

Is there a way to keep the protein id (XP_012370245) instead of the nucleotide id (XM_012514791.1)? Something like:

> rec
[1] ">XP_012370245
 \nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

Any suggestion is very welcome, thanks!

dwinter commented 7 years ago

Hi @santiagoha,

I don't think there is any way to do this on the NCBI end (nucleotide records have nucleotide IDs).

You can probably cook something up to replace the IDs through stringr or the base string manipulation functions. This works for your example (though you probably want to add some checks to make sure you are actually findings CDS sequences etc):

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
    sub("XM_\\d+\\.\\d", prot_acc, rec)    
}

cat(substr(fetch_cds("XP_012370245"), 1, 500), "\n")

>XP_012370245 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
ATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG
CCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG
CCTGGCGGAGCTCTGGGGTGCCGACCTGGGGTCCGCCTGGAGGCGGCTTCACGCCACCGAACTGTGTCCG
CGCGGCGCAGTCCGCGTGGTGGCAGCGGTGGCGCCGCGGGGCCGCCTGGTGTGGTGCGAGGAGCGCCCGG
GCGCGGGCGGACGCCGCGTGTGCGTCCGCACCCTAGAGCCTGGCGGCGAGACTGGTGCCCGCCTGGGCCG
CACGCACGTCCTGCTGCACCACTGCCCGCCCTTCCACCTGCTGGCCTCGCGCAAGGACGTCTTCC

santiagoha commented 7 years ago

Hi @dwinter, thank you so much for the hint! I will try to apply it now for a large number of IDs

santiagoha commented 7 years ago

Hi @dwinter, this may be a silly question, but as you suggested earlier I am trying to check that the CDS actually exists, and if not, the function should break. I modified your fetch_cds function like this:

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
    if (is.null(links$links$protein_nuccore_mrna[1]))
    {
        stop("No CDS found") 
    }
    else 
    {
        rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
        sub("XM_\\d+\\.\\d", prot_acc, rec)
    }     
}

When I apply it to an extant CDS, it works fine, however if I apply it to a protein ID that don't have the CDS it returns the following error:

fetch_cds("XP_004623289.1")
Error in res[[1]] : subscript out of bounds

I have try to solve it in different ways but I always get the same error. I understand that the error is telling that I'm trying to access some element that is outside the limits of the list/array, however, I don't understand which list/array is causing the error. Am I missing something? Thank you very much for help.

dwinter commented 7 years ago

Hey @santiagoha, can you provide me with an ID that throws this error. Are you sure ther is a protein with this accession?

A couple of general debugging pointers.

You can use traceback() to get the particular series of calls that gives an error (whch could give you a clue as to what's gong on.
It's always a good idea to break the function down and run each line outside of the function scope while using the same inputs that are producing errors. That way you can see exactly when the errors occur

santiagoha commented 7 years ago

Hi @dwinter, I just checked in NCBI and apparently the record for this accession (XP_004623289.1) was removed: https://www.ncbi.nlm.nih.gov/protein/XP_004623289.1?report=genpept Maybe this is why it doesn't has a corresponding CDS?

As you suggested, I previously broke the function and tried it with this ID: XP_004623289.1. The entrez_search and entrez_link functions worked fine. I then checked what was inside links:

links$links
elink result with information from 1 databases:
[1] protein_nuccore

So, it doesn't have the _protein_nuccoremrna database. And, actually, if I try to call this database it is NULL:

links$links$protein_nuccore_mrna[1]
NULL

This is why I used this a the conditional statement to break the function.

I used traceback() and this is what I'm getting:

traceback()
3: parse_elink(record, cmd = cmd, by_id = by_id)
2: entrez_link(dbfrom = "protein", db = "nuccore", id = search1$ids) at #3
1: fetch_cds("XP_004623289.1")

But I still don't understand what is causing error, is the parse_elinkfunction? Thank you very much for your time and advises!

dwinter commented 7 years ago

Hey @santiagoha ,

There are no IDs in the search result

entrez_search(db="protein", term="XP_004623289.1[Accn]")

Entrez search result with 0 hits (object contains 0 IDs and no web_history object)
 Search term (as translated):  (XP_004623289.1[Accn])

As a result, elink is failing. We should probably check for non-zero length ID vectors within rentrez, but for now you probably want to catch these before you go hunting for links.

santiagoha commented 7 years ago

Hi @dwinter, I didn't noticed that there were no IDs! I modified the function as follows:

fetch_cds <- function(prot_acc){
    search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
    tryCatch({
        if (search1$count == 0) stop("No CDS found")
        else 
        {
        links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
        rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
        sub("XM_\\d+\\.\\d", prot_acc, rec)
        } 
    }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) 
}

So, basically, it searches if there is any id for the corresponding Protein ID, and I added tryCatch to avoid the function to break. This is useful if you want to iterate the function over a list of IDs.

For example, for the list of IDs: "XP_004633320.1" "XP_004623289.1" "XP_004626331.1", the second Protein ID don't have a corresponding id with the entrez_searchfunction, but the other two are extant ids.

ids.a <- c("XP_004633320.1", "XP_004623289.1" ,"XP_004626331.1")
CDSs <- list()
for(i in 1:length(ids.a))
{
    id <- ids.a[i]
    cds <- fetch_cds(id)
    CDSs[[i]] <- cds
}

And the output looks as follows:

> CDSs
[[1]]
[1] ">XP_004633320.1 PREDICTED: Octodon degus major facilitator superfamily domain containing 9 (Mfsd9), mRNA
\nATGTGGTGCCTGAGGTTCGGGGCCCCGGGCGGGCCAGCTTCAGCCACTCCGGAGCTGCCGCCGGAGCGAG\nGCCCGGGCGCGGTGGGAGCTCGCCGCTTCCTGCTTGGCCTCTACCTGGTGGGCTTCCTGG (...)

[[2]]
NULL

[[3]]
[1] ">XP_004626331.1 PREDICTED: Octodon degus endothelin receptor type B (Ednrb), mRNA
\nCGGAGCGGCCGGTACACGCACTGGAGCAAGCAGTAGCATGCAGTCGCCAACAAGTCTGTGCAGACTCGCC\nCTGGTGGCTCTGGTTTTTGCCTGCATTTCGGCCGGGGTCTGGGGAGAGGAGAGAGGATTCCC (...)

I don't know if this is the most efficient approximation but it works, I hope this can be useful, thanks again for your help!

dwinter commented 7 years ago

closing now, feel free to sumbmit more issues if you run into anything else.

ropensci / rentrez

How to keep Protein ID when retrieving coding sequences #107