How can I return the number of nucleotide base pairs across many records?

rossmounce commented 7 years ago

I'd like to extract the number of nucleotide base pairs from each and every esummary record (from nuccore) for a given taxon. I've got to the stage where I can extract it from one individual record, but I was hoping there's a way to use extract_from_esummary() to do it across multiple records?

Any help would be much appreciated.

r_search <- entrez_search(db="nuccore", term="Ilex kaushue[ORGN]", retmax=50)
#Returns 44 hits
x_summ <- entrez_summary(db="nuccore", id=r_search$ids)
#Saves 44 esummary records
x_summ$`328926792`$statistics$count[1]
#Gives me the value I want to extract for one of the records ( but I want to extract this value from ALL esummary records )
extract_from_esummary(x_summ,elements="statistics")
 343194329    343194328    343194327    924399621    924399613    924399608    924399603   
type    Character,15 Character,15 Character,15 Character,15 Character,15 Character,15 Character,15
count   Integer,15   Integer,15   Integer,15   Integer,15   Integer,15   Integer,15   Integer,15  
subtype Character,15 Character,15 Character,15 Character,15 Character,15 Character,15 Character,15
source  Character,15 Character,15 Character,15 Character,15 Character,15 Character,15 Character,15
...
#Not sure where to go from here

( I'm using rentrez_1.0.4 )

dwinter commented 7 years ago

Hi @rossmounce,

Yeah, these nested records can get hard to deal with. Two ways around this:

Maybe you just want slen?

In addition to the detailed statistics, each summary record has an "slen" object. I gather (but maybe check if it is important) this is the total length of the sequence, and might include "N"s etc.

slens <- extract_from_esummary(x_summ, elements="slen")
head(slens)

343194329 343194328 343194327 924399621 924399613 924399608 
      484       484       484       449       404       434

List to dataframe

If you need the extra details in the statistics field it is possible to turn the nested data structure into a data.frame. Though it takes a bit of work:

stat_list <- lapply(x_summ,  "[[", "statistics")
stats_df <- do.call(rbind.data.frame, stat_list)
stats_df$uid <- rep(names(stat_list), sapply(stat_list, nrow))
head(stats_df)

                 type count      subtype source       uid
343194329.1    Length   484         <NA>   <NA> 343194329
343194329.2    Length   484      literal   <NA> 343194329
343194329.3       all     2         <NA>   <NA> 343194329
343194329.4 blob_size  3543         <NA>   <NA> 343194329
343194329.5       imp     1         <NA>   <NA> 343194329
343194329.6       imp     1 misc_feature   <NA> 343194329

Hope that helps

rossmounce commented 7 years ago

Ah thanks!

I didn't see slens had effectively the same thing I wanted. This is great. Problem solved.

(FYI, I'm trying to determine what the amount of public sequence data there is available for ALL plant species, per species!)

Cheers

dwinter commented 7 years ago

Happy to help.

re: your grand plan, you might like these (counting the number of records, rather than length of sequences): https://gist.github.com/dwinter/8b98bf57a287ac9f389a

rossmounce commented 7 years ago

Very nice! I love treemap visualisations. Not used enough imo!

ropensci / rentrez

How can I return the number of nucleotide base pairs across many records? #104

Maybe you just want slen?

List to dataframe