ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

Acessing Biosample Attributes #154

Closed fconstancias closed 3 years ago

fconstancias commented 3 years ago

Hi All,

I am struggling to extract Attributes from several Biosamples in a nice tibble.

df_all = NULL
for ( id in c("SAMN12414413","SAMN08472433"))
{
  entrez_fetch(db="biosample", id = id, rettype = "xml", parsed=FALSE) -> data

  data %>%
    xmlParse -> doc

  doc %>%
    xmlToDataFrame(nodes=getNodeSet(doc, "//Attributes")) -> df

  df %>%
    t() %>%
    data.frame() -> df
  colnames(df) = id
  bind_rows(df,
            df_all) -> df_all
}
df_all

But since I did not manage to extract display_name or attribute_name from the xml I can't really make it work.

df_all

                                            SAMN08472433 SAMN12414413
Attribute...1                                 ATCC 33285         <NA>
NA....2                                         Aug-1979         <NA>
NA..1...3                              Human oral cavity         <NA>
NA..2...4                                        Unknown         <NA>
NA..3...5                                        curette         <NA>
NA..4                                USA: Blacksberg, VA         <NA>
NA..5                                    37.23 N 80.42 W         <NA>
NA..6                                       Homo sapiens         <NA>
NA..7                                               TSBY         <NA>
NA..8                                                  1         <NA>
NA..9                                         ATCC 33285         <NA>
NA..10                              Periodontitis severe         <NA>
NA..11                                          anaerobe         <NA>
NA..12                                             Black         <NA>
NA..13                                                20         <NA>
NA..14                                            female         <NA>
NA..15         type strain of Bacteroides zoogleoformans         <NA>
NA..16                                        ATCC:33285         <NA>
NA..17                                      pure culture         <NA>
Attribute...20                                      <NA>         638R
NA....21                                            <NA> Homo sapiens
NA..1...22                                          <NA>     Oct-2017
NA..2...23                                          <NA>          USA
NA..3...24                                          <NA> cell culture

Any help will be very much appreciated.

Thanks a ton

dwinter commented 3 years ago

Hi @fconstancias ,

I think I have worked something out for this, let me know if it works as expected.

library(XML)
library(rentrez)
library(tidyverse)
summarise_biosample <- function(biosample_id){                                                                                         
     parsed <- entrez_fetch(db="biosample", id=biosample_id, rettype="xml", parsed=TRUE)
     attr_values <- xpathSApply(parsed, "//Attributes/Attribute", xmlValue)
     attr_names <- xpathApply(parsed, "//Attributes/Attribute", xmlAttrs)
     sample_df <- data.frame(attribute_type = unlist(lapply(attr_names, names)),
                             attribute = unlist(attr_names),
                             value = rep(attr_values, lengths(attr_names)))
     sample_df$biosample <- biosample_id
     sample_df
}

biosamples <- c("SAMN12414413","SAMN08472433")
res <- bind_rows(lapply(biosamples, summarise_biosample))
head(res)
   attribute_type attribute        value    biosample
1  attribute_name    strain         638R SAMN12414413
2 harmonized_name    strain         638R SAMN12414413
3    display_name    strain         638R SAMN12414413
4  attribute_name      host Homo sapiens SAMN12414413
5 harmonized_name      host Homo sapiens SAMN12414413
6    display_name      host Homo sapiens SAMN12414413