Acessing Biosample Attributes

ropensci / rentrez

talk with NCBI entrez using R

Other

194 stars 38 forks source link

Hi All,

I am struggling to extract Attributes from several Biosamples in a nice tibble.

df_all = NULL
for ( id in c("SAMN12414413","SAMN08472433"))
{
  entrez_fetch(db="biosample", id = id, rettype = "xml", parsed=FALSE) -> data

  data %>%
    xmlParse -> doc

  doc %>%
    xmlToDataFrame(nodes=getNodeSet(doc, "//Attributes")) -> df

  df %>%
    t() %>%
    data.frame() -> df
  colnames(df) = id
  bind_rows(df,
            df_all) -> df_all
}
df_all

But since I did not manage to extract display_name or attribute_name from the xml I can't really make it work.

df_all

                                            SAMN08472433 SAMN12414413
Attribute...1                                 ATCC 33285         <NA>
NA....2                                         Aug-1979         <NA>
NA..1...3                              Human oral cavity         <NA>
NA..2...4                                        Unknown         <NA>
NA..3...5                                        curette         <NA>
NA..4                                USA: Blacksberg, VA         <NA>
NA..5                                    37.23 N 80.42 W         <NA>
NA..6                                       Homo sapiens         <NA>
NA..7                                               TSBY         <NA>
NA..8                                                  1         <NA>
NA..9                                         ATCC 33285         <NA>
NA..10                              Periodontitis severe         <NA>
NA..11                                          anaerobe         <NA>
NA..12                                             Black         <NA>
NA..13                                                20         <NA>
NA..14                                            female         <NA>
NA..15         type strain of Bacteroides zoogleoformans         <NA>
NA..16                                        ATCC:33285         <NA>
NA..17                                      pure culture         <NA>
Attribute...20                                      <NA>         638R
NA....21                                            <NA> Homo sapiens
NA..1...22                                          <NA>     Oct-2017
NA..2...23                                          <NA>          USA
NA..3...24                                          <NA> cell culture

Any help will be very much appreciated.

Thanks a ton

library(XML) library(rentrez) library(tidyverse) summarise_biosample <- function(biosample_id){ parsed <- entrez_fetch(db="biosample", id=biosample_id, rettype="xml", parsed=TRUE) attr_values <- xpathSApply(parsed, "//Attributes/Attribute", xmlValue) attr_names <- xpathApply(parsed, "//Attributes/Attribute", xmlAttrs) sample_df <- data.frame(attribute_type = unlist(lapply(attr_names, names)), attribute = unlist(attr_names), value = rep(attr_values, lengths(attr_names))) sample_df$biosample <- biosample_id sample_df } biosamples <- c("SAMN12414413","SAMN08472433") res <- bind_rows(lapply(biosamples, summarise_biosample)) head(res)

attribute_type attribute value biosample 1 attribute_name strain 638R SAMN12414413 2 harmonized_name strain 638R SAMN12414413 3 display_name strain 638R SAMN12414413 4 attribute_name host Homo sapiens SAMN12414413 5 harmonized_name host Homo sapiens SAMN12414413 6 display_name host Homo sapiens SAMN12414413

ropensci / rentrez

Acessing Biosample Attributes #154