extract_form_esummary matrix cannot be cleanly written to csv

ropensci / rentrez

talk with NCBI entrez using R

https://docs.ropensci.org/rentrez

Other

195 stars 38 forks source link

extract_form_esummary matrix cannot be cleanly written to csv #65

Closed gadepallivs closed 8 years ago

gadepallivs commented 8 years ago

Hi David, Below is the example. I did not understand why title, fulljournalname, pubtype has the text data extending to second column.

PM.ID <- c("26287849", "25979833", "25667274", "25430497", "24968756", "24846037", "24296758", "24281417", "24128713", "24055406","23489023")
p.data <- entrez_summary(db = "pubmed", id = PM.ID  )
pubrecord.table <- extract_from_esummary(esummaries = p.data , elements = c("uid","title","fulljournalname", "pubtype", "volume", "issue", "pages",                                                                           "lastauthor","pmcrefcount", "issn", "pubdate" ))
is(pubrecord.table) #  "matrix"         "array"          "structure"      "vector"         "vectorORfactor"
pubrecord.table <- t(pubrecord.table) # transpose the rows into columns
write.csv(pubrecord.table , file = "test12.csv" )

dwinter commented 8 years ago

This is not really a problem with rentrez, just a property of NCBI records and R objects.

In this case, the pubtype field is variably-sized:

sapply(pubrecord.table[4,], length)

26287849 25979833 25667274 25430497 24968756 24846037 24296758 24281417 
       2        2        1        2        1        3        1        2 
24128713 24055406 23489023 
       1        2        2

When you try and write the matrix it represents the vectrors like you'd type them in (c(..., ...)) which adds a comma which breaks the csv format.

In this case, you can collapse the vectors:

pubrecord.table[4,] <- sapply(pubrecord.table[4,], paste, collapse=" & ")

and unlist each matrix row to allow them to be written out

f <- tempfile()
write.csv( apply(pubrecord.table, 1, unlist), f)
re_read <- read.csv(f)
re_read$pmcrefcount

 [1]  0  1  3  2  1 26 10  4  3  2 21

gadepallivs commented 8 years ago

Hi david, The solution above works on certain PMID queries, but for others I still get an error. Depending on PMID the variable field lengths are noted in Title, Journal name , pubtype or something else. I thought just removing the row number will fix the issue. But, I get error when trying to write a table on Rshiny pubrecord.table[,] <- sapply(pubrecord.table[,], paste, collapse=" & ")

Error in apply(pubrecord.reference, 1, unlist) : dim(X) must have a positive length P.S Why was the function extract_form_esummary designed to return a matrix ? The data it extracts is a mix of character, string , numeric vectors and so by definition dataframe would ideal to store these kind of data, while matrix is is expected to store data of the same type ?

dwinter commented 8 years ago

I'm not sure what you are trying to in the example, but it seems like it's hitting empty fields?

extract_form_esummary is really a wrapper to sapply, it doesn't return data.frames because I think most users don't expect data.frame columns to contain vectors like

df <- as.data.frame(t(pubrecord.table))
df$pubtype

$`26287849`
[1] "Journal Article"   "Multicenter Study"

$`25979833`
[1] "Journal Article"             "Randomized Controlled Trial"

$`25667274`
[1] "Journal Article"
.
.
.

Structured data like that would seem to fit a list better than a data.frame, and you can get that by setting simplify=FALSE.

gadepallivs commented 8 years ago

_Edited, noted the issue _ Hi david, I noted the issue was with empty abstract fields for some entries.

PM.ID <- c("26391251","26372702","26372699","26371045","26338018","26317919",
            "26315966","26301800","26301799","26258891")
fetch.pubmed <- entrez_fetch(db = "pubmed", id = pubmed.search$ids,
                              rettype = "xml", parsed = T)
abstracts = xpathApply(fetch.pubmed, '//PubmedArticle//Article', function(x) xmlValue(xmlChildren(x)$Abstract))

This results in NA for PMIDs where abstracts are empty. But, when It is being rendered using Rshiny it has problem displaying the table just shows "Processing" but does not display any table. need to learn more about it. This is not related to rentrez package. Thank you

dwinter commented 8 years ago

OK, good luck to getting to the bottom of the shiny problem :)