ropensci / webchem

Chemical Information from the Web
https://docs.ropensci.org/webchem
Other
160 stars 40 forks source link

Images #249

Closed stitam closed 4 years ago

stitam commented 4 years ago

Some web services allow us to retrieve images of substances. There are already implementations in PR #235 and PR #247. Also ChemSpider and PubChem both offer such functionality. I thought it would be good to open a small discussion about design considerations to ensure consistency.

On first thought, I think I would prefer these functions to return a list of raw character vectors. I tried png, svg, jpeg, these can all be accessed e.g. through httr::content(httr::GET(url), type = "image"). Raw vectors can then be rendered easily with image processing packages like magick, e.g. plot(as.raster(image_read(raw))) and also exported in any format, but we wouldn't have to import them as dependencies, so webchem could remain lightweight.

What do you think?

andschar commented 4 years ago

I was about to open the same issue^^

andschar commented 4 years ago

IUp to now, I haven't thought of the output argument. Why should we want to retrieve only the paths and not directly store the image to disk?

andschar commented 4 years ago

There is also the CIR service which provides this functionality. I will include a function for it in PR #247.

stitam commented 4 years ago

I thought one option could be output = "download" and then path would be used to tell the function where to save, and the other could be output = "image" and then path wouldn't be used and the function would keep the images in R to be used for analysis/reporting. Maybe calling the argument dir, as you suggested is less confusing than path. Regardless, I am leaning towards no export functionality and leaving that to other functions. What do you think?

stitam commented 4 years ago

There is also the CIR service which provides this functionality. I will include a function for it in PR #247.

Though maybe it would be better to separate it into another PR.

andschar commented 4 years ago

I thought one option could be output = "download" and then path would be used to tell the function where to save, and the other could be output = "image" and then path wouldn't be used and the function would keep the images in R to be used for analysis/reporting. Maybe calling the argument dir, as you suggested is less confusing than path. Regardless, I am leaning towards no export functionality and leaving that to other functions. What do you think?

I would rather include the export functionality b/c the image is anyway created by the web service. We just have to call download.file or curl::download_file() then to store it somewhere to disk.

Hm, not sure about keeping images in R. They can probably also get quite big. I would rather store them to disk, and read them again, when needed, using a non-webchem function. E.g. knitr::include_graphics() in an RMardown for example. This approach would also not include any new dependencies.

andschar commented 4 years ago

There is also the CIR service which provides this functionality. I will include a function for it in PR #247.

Though maybe it would be better to separate it into another PR.

Agreed.

andschar commented 4 years ago
* I agree with point one, to choose one and I also prefer `img`. However, maybe it's better to turn the name around like in `get_*()` to have such a naming convention `img_actor()`. What do you think?

What do you prefer? actor_img() or img_actor()?

stitam commented 4 years ago
* I agree with point one, to choose one and I also prefer `img`. However, maybe it's better to turn the name around like in `get_*()` to have such a naming convention `img_actor()`. What do you think?

What do you prefer? actor_img() or img_actor()?

I prefer actor_img() because it is more consistent e.g. with etox_basic(), pc_prop().

stitam commented 4 years ago

I see your point regarding image size. However, if you wanted to use images of compounds as descriptors for an ML problem (I don't know if this is a reasonable thing to do), then returning a vector of images and keeping them in R would be much simpler than importing them from hard drive.

andschar commented 4 years ago
* I agree with point one, to choose one and I also prefer `img`. However, maybe it's better to turn the name around like in `get_*()` to have such a naming convention `img_actor()`. What do you think?

What do you prefer? actor_img() or img_actor()?

I prefer actor_img() because it is more consistent e.g. with etox_basic(), pc_prop().

Ok, sounds also good to me.

Aariq commented 4 years ago

I thought one option could be output = "download" and then path would be used to tell the function where to save, and the other could be output = "image" and then path wouldn't be used and the function would keep the images in R to be used for analysis/reporting. Maybe calling the argument dir, as you suggested is less confusing than path. Regardless, I am leaning towards no export functionality and leaving that to other functions. What do you think?

Some image types can't be represented as R objects and only a download/export option would make sense. E.g. I think in #235 one of the file options is a PDF.

Aariq commented 4 years ago

I think they image functions should either return raster images, or in the case that they download image files, they should return the file paths (silently). That will make integration with things like knitr::include_graphics() or pander::pandoc.image.return() easier so we can make things like this:

library(webchem)
library(pander)
library(knitr)

id <- c("Glyphosate", "Isoproturon", "BSYNRYMUTXBXSQ-UHFFFAOYSA-N")
cir_img(id, bgcolor = "transparent", antialising = 0, width = 200, height = 200, symbolfontsize = 9, dir = here::here())
#> Querying: Glyphosate
#> https://cactus.nci.nih.gov/chemical/structure/Glyphosate/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Glyphosate.png
#> Querying: Isoproturon
#> https://cactus.nci.nih.gov/chemical/structure/Isoproturon/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Isoproturon.png
#> Querying: BSYNRYMUTXBXSQ-UHFFFAOYSA-N
#> https://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/BSYNRYMUTXBXSQ-UHFFFAOYSA-N.png
paths <- list.files(here::here(), pattern = "*.png", full.names = TRUE)

kable(data.frame(id = id, picture = pandoc.image.return(paths)))
id picture
Glyphosate
Isoproturon
BSYNRYMUTXBXSQ-UHFFFAOYSA-N

Created on 2020-05-06 by the reprex package (v0.3.0)

(pictures work when code is run locally with PR #250, but they don't show up in correct order. If cir_img() returned file paths, this would work correctly and with less code.)

andschar commented 4 years ago

I see your point regarding image size. However, if you wanted to use images of compounds as descriptors for an ML problem (I don't know if this is a reasonable thing to do), then returning a vector of images and keeping them in R would be much simpler than importing them from hard drive.

I have never worked with images as objects in R. Can all the common image formats (.png, .jpg, .giv, .svg) be imported as image objects? What class is that? Are these the raster images @Aariq mentioned in the above comment?

Would I have to convert the images for that? Which function?

Sorry, for all the questions, I'm a bit puzzled.

andschar commented 4 years ago

I think they image functions should either return raster images, or in the case that they download image files, they should return the file paths (silently). That will make integration with things like knitr::include_graphics() or pander::pandoc.image.return() easier so we can make things like this:

library(webchem)
library(pander)
library(knitr)

id <- c("Glyphosate", "Isoproturon", "BSYNRYMUTXBXSQ-UHFFFAOYSA-N")
cir_img(id, bgcolor = "transparent", antialising = 0, width = 200, height = 200, symbolfontsize = 9, dir = here::here())
#> Querying: Glyphosate
#> https://cactus.nci.nih.gov/chemical/structure/Glyphosate/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Glyphosate.png
#> Querying: Isoproturon
#> https://cactus.nci.nih.gov/chemical/structure/Isoproturon/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Isoproturon.png
#> Querying: BSYNRYMUTXBXSQ-UHFFFAOYSA-N
#> https://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/BSYNRYMUTXBXSQ-UHFFFAOYSA-N.png
paths <- list.files(here::here(), pattern = "*.png", full.names = TRUE)

kable(data.frame(id = id, picture = pandoc.image.return(paths)))

id picture Glyphosate Isoproturon BSYNRYMUTXBXSQ-UHFFFAOYSA-N

Created on 2020-05-06 by the reprex package (v0.3.0)

(pictures work when code is run locally with PR #250, but they don't show up in correct order. If cir_img() returned file paths, this would work correctly and with less code.)

Thank @Aariq for the comment. I also like the idea of downloading the images and returning the paths. Just an idea: What about returning a data.frame with paths to local files and URLs?

andschar commented 4 years ago

I have updated PR #250 to return (1) a file written to disk and (2) a data.frame containing the local path and the URL. What do you think?

stitam commented 4 years ago

I have never worked with images as objects in R. Can all the common image formats (.png, .jpg, .giv, .svg) be imported as image objects? What class is that? Are these the raster images @Aariq mentioned in the above comment?

Some image types can't be represented as R objects and only a download/export option would make sense. E.g. I think in #235 one of the file options is a PDF.

Working with images as objects:

#works with .png, .svg, .jpeg, .pdf
library(magick)
library(httr)
url <- "https://actorws.epa.gov/actorws/chemical/image?smiles=CC%3DO&w=400&h=400&fmt=png"
cont <- content(GET(url), type = "image") #object of class "raw"
img <- image_read(cont) #object of class "magick-image"
raster <- as.raster(img) #object of class "raster"
plot(img)
plot(raster)

We can get to raw without depending on another package but we cannot get to raster I think. On the other hand, image processing packages work well with images on disk, so maybe it is simpler to just export, I rest my case for keeping image objects in R.

Just an idea: What about returning a data.frame with paths to local files and URLs?

I like the idea of returning a data.frame! I think the query column would also be useful similarly to the get_*() functions. Are relative paths enough for subsequent functions or do we need absoIute paths? I don't know if including the urls in the output has a benefit, other webchem functions only print the url in verbose messages.

andschar commented 4 years ago

Ok, cool. I really think adding dependencies such as magick and raster (?) is a bit too much for webchem. Just a side note: In my experience, having everything in memory and not writing things to disc is a very "R" thing. Though I'm also not very proficient in other prog languages and this statement might not be TRUE^^

stitam commented 4 years ago

It's just very very polite:)

andschar commented 4 years ago

Ok, now that we agree to write to disk, what function should we use? As always in R, there are many possibilities. I prefer httr:: style, just because it is similar to other GET() and POST() functions in webchem. What about you?

qurl = "https://cactus.nci.nih.gov/chemical/structure/Triclosan/image?format=png&width=500&height=500&linewidth=2&symbolfontsize=16&csymbol=special&hsymbol=special"
# httr
require(httr)
h <- try(
  GET(qurl,
      timeout(5),
      write_disk(tempfile(), overwrite = TRUE))
)
# base
download.file(qurl,
              destfile = tempfile())
# curl
curl::curl_download(qurl,
                    destfile = tempfile())
stitam commented 4 years ago

I personally prefer httr, also because of the nice status messages that can be used for verbose output, but as long as it doesn't involve another dependency, I'd rather not search consensus here:)

andschar commented 4 years ago

I personally prefer httr, also because of the nice status messages that can be used for verbose output, but as long as it doesn't involve another dependency, I'd rather not search consensus here:)

Sounds like consensus to me :)

Aariq commented 4 years ago

I think I've had a bit of change of heart on this issue.

I think to keep things simple, image functions could return png files as R objects (with png::readPNG()) when no directory is specified and plot them by default (e.g. with grid::grid.raster()). If no directory is specified, then you shouldn't be allowed to get other file types. If other file types are specified, then the user must supply a directory. In the case that a directory is specified, the function should return file paths silently. Why? I think saving files to to a temporary directory is less transparent than returning R objects and more useful. If you wanted to use them externally to R, then you'd provide a directory, and if you want to use them internally, then this skips writing them to file and then reading back in. WIth defaults, the image function provides a way to preview the image, and then, once the user is satisfied, they can provide a directory and a file type for download.

andschar commented 4 years ago

I think I've had a bit of change of heart on this issue.

I think to keep things simple, image functions could return png files as R objects (with png::readPNG()) when no directory is specified and plot them by default (e.g. with grid::grid.raster()). If no directory is specified, then you shouldn't be allowed to get other file types. If other file types are specified, then the user must supply a directory. In the case that a directory is specified, the function should return file paths silently. Why? I think saving files to to a temporary directory is less transparent than returning R objects and more useful. If you wanted to use them externally to R, then you'd provide a directory, and if you want to use them internally, then this skips writing them to file and then reading back in. WIth defaults, the image function provides a way to preview the image, and then, once the user is satisfied, they can provide a directory and a file type for download.

Thanks @Aariq for the comment!

Would you store multiple outputs in an R list or directly plot them then?

Agreeing:

My concerns:

Code example for the CIR image service comparing the two approaches

# setup -------------------------------------------------------------------
require(httr)
query = c(
  "25057-89-0", "72-43-5", "640-15-3", "82-68-8", "50471-44-8", 
  "13457-18-6", "94-82-6", "93-76-5", "15972-60-8", "834-12-8", 
  "1912-24-9", "2642-71-9", "314-40-9", "1689-84-5", "470-90-6", 
  "1698-60-8", "1982-47-4", "2921-88-2", "15545-48-9", "21725-46-2", 
  "52315-07-8", "6190-65-4", "30125-63-4", "1007-28-9", "1014-69-3", 
  "333-41-5", "120-36-5", "62-73-7", "115-32-2", "83164-33-4", 
  "60-51-5", "298-04-4", "330-54-1", "38260-54-7", "122-14-5", 
  "93-72-1", "67564-91-4", "55-38-9", "96525-23-4", "51235-04-2"
)

# Option 1: Store as R object ---------------------------------------------
img_l = list()
for (i in query) {
  # url
  qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
               i,
               "image?format=png&width=700&height=700",
               sep = '/')
  # query
  Sys.sleep(1.5)
  res = GET(qurl,
            timeout(5))
  img = png::readPNG(res$content)
  img_l[[i]] = img
  message('Processed: ', i, ' (stored in list).') # ~ 8KB / image: object.size(img)
}

object.size(img_l) # already 470 MB in memory

# Option 2: Save local ----------------------------------------------------
for (i in query) {
  # url
  qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
               i,
               "image?format=png&width=700&height=700",
               sep = '/')
  # query
  Sys.sleep(1.5)
  GET(qurl,
      timeout(5),
      write_disk(file.path(tempdir(), paste0(i, ".png")), overwrite = TRUE))
  # ~ 8KB / image
  message('Processed: ', i, ' (saved to disk).')
}

This reflects just my opinion. If you @Aariq and @stitam prefer to including images as R objects, I can also live with it :) Feel free to build on the code snippet.

stitam commented 4 years ago

Just some ideas how others manage images: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0405-0. rckd::view.molecule.2d executes a java command which returns the image?

stitam commented 4 years ago

I am ok with removing tempdir(), write.csv() doesn't write to tempdir either. I would love to have images in memory but I am also a concerned about importing image processing methods, I think webchem is quite heavy already. We still have to option of keeping the images in raw format (the natural class for the content of a GET() or POST() request), providing some good examples, but otherwise and leaving it to the user to import image processing packages if they want to work with images.

require(httr)
query = c(
  "25057-89-0", "72-43-5", "640-15-3", "82-68-8", "50471-44-8", 
  "13457-18-6", "94-82-6", "93-76-5", "15972-60-8", "834-12-8", 
  "1912-24-9", "2642-71-9", "314-40-9", "1689-84-5", "470-90-6", 
  "1698-60-8", "1982-47-4", "2921-88-2", "15545-48-9", "21725-46-2", 
  "52315-07-8", "6190-65-4", "30125-63-4", "1007-28-9", "1014-69-3", 
  "333-41-5", "120-36-5", "62-73-7", "115-32-2", "83164-33-4", 
  "60-51-5", "298-04-4", "330-54-1", "38260-54-7", "122-14-5", 
  "93-72-1", "67564-91-4", "55-38-9", "96525-23-4", "51235-04-2"
)

img_l = list()
for (i in query) {
  # url
  qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
               i,
               "image?format=png&width=700&height=700",
               sep = '/')
  # query
  Sys.sleep(1.5)
  res = GET(qurl,
            timeout(5))
  img = content(res, type = "image) ### this is the change.
  img_l[[i]] = img
  message('Processed: ', i, ' (stored in list).') # ~ 8KB / image: object.size(img)
}

object.size(img_l)
332192 bytes

#plot image
library(magick)
img <- image_read(img_l[[1]]) #object of class "magick-image"
plot(img)
andschar commented 4 years ago

Hm, that sounds convincing @stitam and would also greatly reduce the RAM footprint. So, something like this, where the user would only decide (1) whether to leave directory = NULL (default) and return the image as a raw vector OR (2) to provide a directory = "path" and have the images stored to disk. In the latter case, should we also return a data.frame with links?

require(httr)
require(dplyr)

query = c(
  "25057-89-0", "72-43-5", "640-15-3", "82-68-8", "50471-44-8", 
  "13457-18-6", "94-82-6", "93-76-5", "15972-60-8", "834-12-8", 
  "1912-24-9", "2642-71-9", "314-40-9", "1689-84-5", "470-90-6", 
  "1698-60-8", "1982-47-4", "2921-88-2", "15545-48-9", "21725-46-2", 
  "52315-07-8", "6190-65-4", "30125-63-4", "1007-28-9", "1014-69-3", 
  "333-41-5", "120-36-5", "62-73-7", "115-32-2", "83164-33-4", 
  "60-51-5", "298-04-4", "330-54-1", "38260-54-7", "122-14-5", 
  "93-72-1", "67564-91-4", "55-38-9", "96525-23-4", "51235-04-2"
)
# dummy function
fun_img <- function(query, directory = NULL) {
  foo <- function(query, directory) {
      qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
                   query,
                   "image?format=png&width=700&height=700",
                   sep = '/')
    # query
    Sys.sleep(1.5)
    if (is.null(directory)) {
      message("Retrieving raw image: ", query)
      res <- GET(qurl,
                 timeout(5))
      content(res, type = "image")
    } else {
      message("Storing image: ", query, " to: ", directory)
      GET(qurl,
          timeout(5),
          write_disk(file.path(directory, paste0(query, ".png")), # depending on the webservice, more options here
                     overwrite = TRUE))
      data.frame(query = query,
                 qurl = qurl,
                 stringsAsFactors = FALSE)
    }
  }
  l <- lapply(query, foo, directory = directory)
  names(l) <- query
  if (is.null(directory)) {
    l
  } else {
    dplyr::bind_rows(l) 
  }
}

img_l = fun_img(query[1:2])
lapply(img_l, head)
img_df = fun_img(query[1:2], directory = tempdir())
img_df
Aariq commented 4 years ago

Good points about RAM. It looks like rcdk also uses a temporary directory with tempfile() for downloading molfiles, but it also returns an R object that you can plot with view.molecule.2d(). That's the thing---I don't think there's a point of downloading files to a temporary directory if the user can't easily do something with them in R. I don't think I know enough about raw format images to comment on that.

stitam commented 4 years ago

raw is not as universal as I thought, e.g. while image_read() can convert from raw ACToR images, it doesn't work with raw CACTUS images. png::readPNG() works with both web services, but the outputs are much larger than with image_read(). Maybe it's not a good idea to return raw afterall. BTW if we use png::readPNG() and then image_read() we can shrink the image size considerably;).

stitam commented 4 years ago

I think we should implement the functions as downloaders for now, because that works both with web services that provide urls, and with web services that don't, e.g. ChemSpider. We can always extend that functionality later.

Aariq commented 4 years ago

We should make sure @gjgetzinger is looped in on this decision, since it probably changes some of our suggestions on PR #235

gjgetzinger commented 4 years ago

I've been following the discussion. No objections from my side. I'm happy to make any adjustments to PR #235 as needed. (full disclosure, I am in the middle of a job change and relocation, so will be a minute before I make any changes)

stitam commented 4 years ago

Good luck with your new endeavor! I relocated during last August:)

andschar commented 4 years ago

I think we should implement the functions as downloaders for now, because that works both with web services that provide urls, and with web services that don't, e.g. ChemSpider. We can always extend that functionality later.

To sum up, we agree to download the images and return a data.frame with the paths to the downloaded images. If no directory is supplied they will be downloaded to tempdir(), the default. No raw images for now. Ok?

Argument examples (* mandatory ones):

Should we name the directory argument directory =, dir = or path? I prefer dir =.

Aariq commented 4 years ago

If the output is just file paths, I'd say return it as a vector, silently.

stitam commented 4 years ago

I'd also prefer data.frame because then we have query and result, similarly to get_(). I like the idea of silent return, but I have no idea how that is implemented so I'll peek at your PRs:) Didn't we agree that tempdir() is not good practice afterall? We can just leave dir as a compulsory argument. I like dir because of its brevity, also I think path is ambiguous as it can refer to a directory or a file.

andschar commented 4 years ago

I'd also prefer data.frame because then we have query and result, similarly to get_(). I like the idea of silent return, but I have no idea how that is implemented so I'll peek at your PRs:) Didn't we agree that tempdir() is not good practice afterall? We can just leave dir as a compulsory argument. I like dir because of its brevity, also I think path is ambiguous as it can refer to a directory or a file.

I also thin it's the best idea to have a mandatory dir = argument.

stitam commented 4 years ago

Do you think it is useful to return anything to the console, apart from the optional verbose messages? If we want to import the downloaded images into R later, we can easily construct the paths in the same order with paste0(dir, "/", csids, ".png"). I updated PR #253 for cs_img() by making dir mandatory and removing the tibble output. What do you think?

Aariq commented 4 years ago

Seems fine to me! Might be good to print the path and filename to the console with message() when verbose = TRUE, but I don't think it needs to output anything.