Closed stitam closed 4 years ago
I was about to open the same issue^^
img
. However, maybe it's better to turn the name around like in get_*()
to have such a naming convention img_actor()
. What do you think?query
from
dir
or directory
argument to allow the user to store the image at a specific placetmepdir()
b/c it's OS independantlocaiton
?format
(e.g. svg, png, jpeg)width
& height
verbose
argument at the endIUp to now, I haven't thought of the output
argument. Why should we want to retrieve only the paths and not directly store the image to disk?
There is also the CIR service which provides this functionality. I will include a function for it in PR #247.
I thought one option could be output = "download"
and then path
would be used to tell the function where to save, and the other could be output = "image"
and then path
wouldn't be used and the function would keep the images in R to be used for analysis/reporting. Maybe calling the argument dir
, as you suggested is less confusing than path
. Regardless, I am leaning towards no export functionality and leaving that to other functions. What do you think?
There is also the CIR service which provides this functionality. I will include a function for it in PR #247.
Though maybe it would be better to separate it into another PR.
I thought one option could be
output = "download"
and thenpath
would be used to tell the function where to save, and the other could beoutput = "image"
and thenpath
wouldn't be used and the function would keep the images in R to be used for analysis/reporting. Maybe calling the argumentdir
, as you suggested is less confusing thanpath
. Regardless, I am leaning towards no export functionality and leaving that to other functions. What do you think?
I would rather include the export functionality b/c the image is anyway created by the web service. We just have to call download.file
or curl::download_file()
then to store it somewhere to disk.
Hm, not sure about keeping images in R. They can probably also get quite big. I would rather store them to disk, and read them again, when needed, using a non-webchem function. E.g. knitr::include_graphics()
in an RMardown for example. This approach would also not include any new dependencies.
There is also the CIR service which provides this functionality. I will include a function for it in PR #247.
Though maybe it would be better to separate it into another PR.
Agreed.
* I agree with point one, to choose one and I also prefer `img`. However, maybe it's better to turn the name around like in `get_*()` to have such a naming convention `img_actor()`. What do you think?
What do you prefer? actor_img()
or img_actor()
?
* I agree with point one, to choose one and I also prefer `img`. However, maybe it's better to turn the name around like in `get_*()` to have such a naming convention `img_actor()`. What do you think?
What do you prefer?
actor_img()
orimg_actor()
?
I prefer actor_img()
because it is more consistent e.g. with etox_basic()
, pc_prop()
.
I see your point regarding image size. However, if you wanted to use images of compounds as descriptors for an ML problem (I don't know if this is a reasonable thing to do), then returning a vector of images and keeping them in R would be much simpler than importing them from hard drive.
* I agree with point one, to choose one and I also prefer `img`. However, maybe it's better to turn the name around like in `get_*()` to have such a naming convention `img_actor()`. What do you think?
What do you prefer?
actor_img()
orimg_actor()
?I prefer
actor_img()
because it is more consistent e.g. withetox_basic()
,pc_prop()
.
Ok, sounds also good to me.
I thought one option could be
output = "download"
and thenpath
would be used to tell the function where to save, and the other could beoutput = "image"
and thenpath
wouldn't be used and the function would keep the images in R to be used for analysis/reporting. Maybe calling the argumentdir
, as you suggested is less confusing thanpath
. Regardless, I am leaning towards no export functionality and leaving that to other functions. What do you think?
Some image types can't be represented as R objects and only a download/export option would make sense. E.g. I think in #235 one of the file options is a PDF.
I think they image functions should either return raster images, or in the case that they download image files, they should return the file paths (silently). That will make integration with things like knitr::include_graphics()
or pander::pandoc.image.return()
easier so we can make things like this:
library(webchem)
library(pander)
library(knitr)
id <- c("Glyphosate", "Isoproturon", "BSYNRYMUTXBXSQ-UHFFFAOYSA-N")
cir_img(id, bgcolor = "transparent", antialising = 0, width = 200, height = 200, symbolfontsize = 9, dir = here::here())
#> Querying: Glyphosate
#> https://cactus.nci.nih.gov/chemical/structure/Glyphosate/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Glyphosate.png
#> Querying: Isoproturon
#> https://cactus.nci.nih.gov/chemical/structure/Isoproturon/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Isoproturon.png
#> Querying: BSYNRYMUTXBXSQ-UHFFFAOYSA-N
#> https://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special
#> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/BSYNRYMUTXBXSQ-UHFFFAOYSA-N.png
paths <- list.files(here::here(), pattern = "*.png", full.names = TRUE)
kable(data.frame(id = id, picture = pandoc.image.return(paths)))
id | picture |
---|---|
Glyphosate | |
Isoproturon | |
BSYNRYMUTXBXSQ-UHFFFAOYSA-N |
Created on 2020-05-06 by the reprex package (v0.3.0)
(pictures work when code is run locally with PR #250, but they don't show up in correct order. If cir_img()
returned file paths, this would work correctly and with less code.)
I see your point regarding image size. However, if you wanted to use images of compounds as descriptors for an ML problem (I don't know if this is a reasonable thing to do), then returning a vector of images and keeping them in R would be much simpler than importing them from hard drive.
I have never worked with images as objects in R. Can all the common image formats (.png, .jpg, .giv, .svg) be imported as image objects? What class is that? Are these the raster images @Aariq mentioned in the above comment?
Would I have to convert the images for that? Which function?
Sorry, for all the questions, I'm a bit puzzled.
I think they image functions should either return raster images, or in the case that they download image files, they should return the file paths (silently). That will make integration with things like
knitr::include_graphics()
orpander::pandoc.image.return()
easier so we can make things like this:library(webchem) library(pander) library(knitr) id <- c("Glyphosate", "Isoproturon", "BSYNRYMUTXBXSQ-UHFFFAOYSA-N") cir_img(id, bgcolor = "transparent", antialising = 0, width = 200, height = 200, symbolfontsize = 9, dir = here::here()) #> Querying: Glyphosate #> https://cactus.nci.nih.gov/chemical/structure/Glyphosate/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special #> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Glyphosate.png #> Querying: Isoproturon #> https://cactus.nci.nih.gov/chemical/structure/Isoproturon/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special #> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/Isoproturon.png #> Querying: BSYNRYMUTXBXSQ-UHFFFAOYSA-N #> https://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image?format=png&width=200&height=200&linewidth=2&symbolfontsize=9&bgcolor=transparent&csymbol=special&hsymbol=special #> Image saved under: /private/var/folders/b_/2vfnxxls5vs401tmhhb3wqdh0000gp/T/RtmpH52OZ2/reprex9d0d4d96caea/BSYNRYMUTXBXSQ-UHFFFAOYSA-N.png paths <- list.files(here::here(), pattern = "*.png", full.names = TRUE) kable(data.frame(id = id, picture = pandoc.image.return(paths)))
id picture Glyphosate Isoproturon BSYNRYMUTXBXSQ-UHFFFAOYSA-N
Created on 2020-05-06 by the reprex package (v0.3.0)
(pictures work when code is run locally with PR #250, but they don't show up in correct order. If
cir_img()
returned file paths, this would work correctly and with less code.)
Thank @Aariq for the comment. I also like the idea of downloading the images and returning the paths. Just an idea: What about returning a data.frame with paths to local files and URLs?
I have updated PR #250 to return (1) a file written to disk and (2) a data.frame containing the local path and the URL. What do you think?
I have never worked with images as objects in R. Can all the common image formats (.png, .jpg, .giv, .svg) be imported as image objects? What class is that? Are these the raster images @Aariq mentioned in the above comment?
Some image types can't be represented as R objects and only a download/export option would make sense. E.g. I think in #235 one of the file options is a PDF.
Working with images as objects:
#works with .png, .svg, .jpeg, .pdf
library(magick)
library(httr)
url <- "https://actorws.epa.gov/actorws/chemical/image?smiles=CC%3DO&w=400&h=400&fmt=png"
cont <- content(GET(url), type = "image") #object of class "raw"
img <- image_read(cont) #object of class "magick-image"
raster <- as.raster(img) #object of class "raster"
plot(img)
plot(raster)
We can get to raw
without depending on another package but we cannot get to raster
I think. On the other hand, image processing packages work well with images on disk, so maybe it is simpler to just export, I rest my case for keeping image objects in R.
Just an idea: What about returning a data.frame with paths to local files and URLs?
I like the idea of returning a data.frame! I think the query column would also be useful similarly to the get_*()
functions. Are relative paths enough for subsequent functions or do we need absoIute paths? I don't know if including the urls in the output has a benefit, other webchem functions only print the url in verbose messages.
Ok, cool. I really think adding dependencies such as magick and raster (?) is a bit too much for webchem. Just a side note: In my experience, having everything in memory and not writing things to disc is a very "R" thing. Though I'm also not very proficient in other prog languages and this statement might not be TRUE^^
It's just very very polite:)
Ok, now that we agree to write to disk, what function should we use? As always in R, there are many possibilities. I prefer httr:: style, just because it is similar to other GET()
and POST()
functions in webchem. What about you?
qurl = "https://cactus.nci.nih.gov/chemical/structure/Triclosan/image?format=png&width=500&height=500&linewidth=2&symbolfontsize=16&csymbol=special&hsymbol=special"
# httr
require(httr)
h <- try(
GET(qurl,
timeout(5),
write_disk(tempfile(), overwrite = TRUE))
)
# base
download.file(qurl,
destfile = tempfile())
# curl
curl::curl_download(qurl,
destfile = tempfile())
I personally prefer httr, also because of the nice status messages that can be used for verbose output, but as long as it doesn't involve another dependency, I'd rather not search consensus here:)
I personally prefer httr, also because of the nice status messages that can be used for verbose output, but as long as it doesn't involve another dependency, I'd rather not search consensus here:)
Sounds like consensus to me :)
I think I've had a bit of change of heart on this issue.
I think to keep things simple, image functions could return png files as R objects (with png::readPNG()
) when no directory is specified and plot them by default (e.g. with grid::grid.raster()
). If no directory is specified, then you shouldn't be allowed to get other file types. If other file types are specified, then the user must supply a directory. In the case that a directory is specified, the function should return file paths silently. Why? I think saving files to to a temporary directory is less transparent than returning R objects and more useful. If you wanted to use them externally to R, then you'd provide a directory, and if you want to use them internally, then this skips writing them to file and then reading back in. WIth defaults, the image function provides a way to preview the image, and then, once the user is satisfied, they can provide a directory and a file type for download.
I think I've had a bit of change of heart on this issue.
I think to keep things simple, image functions could return png files as R objects (with
png::readPNG()
) when no directory is specified and plot them by default (e.g. withgrid::grid.raster()
). If no directory is specified, then you shouldn't be allowed to get other file types. If other file types are specified, then the user must supply a directory. In the case that a directory is specified, the function should return file paths silently. Why? I think saving files to to a temporary directory is less transparent than returning R objects and more useful. If you wanted to use them externally to R, then you'd provide a directory, and if you want to use them internally, then this skips writing them to file and then reading back in. WIth defaults, the image function provides a way to preview the image, and then, once the user is satisfied, they can provide a directory and a file type for download.
Thanks @Aariq for the comment!
Would you store multiple outputs in an R list or directly plot them then?
Agreeing:
My concerns:
if (type == "png" & is.null(dir))
).
tempdir()
might not be obvious to a novel user. One idea to prevent that, would be to throw an error (e.g. No directory supplied. Please provide one
), if no directory is supplied.# setup -------------------------------------------------------------------
require(httr)
query = c(
"25057-89-0", "72-43-5", "640-15-3", "82-68-8", "50471-44-8",
"13457-18-6", "94-82-6", "93-76-5", "15972-60-8", "834-12-8",
"1912-24-9", "2642-71-9", "314-40-9", "1689-84-5", "470-90-6",
"1698-60-8", "1982-47-4", "2921-88-2", "15545-48-9", "21725-46-2",
"52315-07-8", "6190-65-4", "30125-63-4", "1007-28-9", "1014-69-3",
"333-41-5", "120-36-5", "62-73-7", "115-32-2", "83164-33-4",
"60-51-5", "298-04-4", "330-54-1", "38260-54-7", "122-14-5",
"93-72-1", "67564-91-4", "55-38-9", "96525-23-4", "51235-04-2"
)
# Option 1: Store as R object ---------------------------------------------
img_l = list()
for (i in query) {
# url
qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
i,
"image?format=png&width=700&height=700",
sep = '/')
# query
Sys.sleep(1.5)
res = GET(qurl,
timeout(5))
img = png::readPNG(res$content)
img_l[[i]] = img
message('Processed: ', i, ' (stored in list).') # ~ 8KB / image: object.size(img)
}
object.size(img_l) # already 470 MB in memory
# Option 2: Save local ----------------------------------------------------
for (i in query) {
# url
qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
i,
"image?format=png&width=700&height=700",
sep = '/')
# query
Sys.sleep(1.5)
GET(qurl,
timeout(5),
write_disk(file.path(tempdir(), paste0(i, ".png")), overwrite = TRUE))
# ~ 8KB / image
message('Processed: ', i, ' (saved to disk).')
}
This reflects just my opinion. If you @Aariq and @stitam prefer to including images as R objects, I can also live with it :) Feel free to build on the code snippet.
Just some ideas how others manage images: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0405-0. rckd::view.molecule.2d
executes a java command which returns the image?
I am ok with removing tempdir()
, write.csv()
doesn't write to tempdir either. I would love to have images in memory but I am also a concerned about importing image processing methods, I think webchem is quite heavy already. We still have to option of keeping the images in raw format (the natural class for the content of a GET()
or POST()
request), providing some good examples, but otherwise and leaving it to the user to import image processing packages if they want to work with images.
require(httr)
query = c(
"25057-89-0", "72-43-5", "640-15-3", "82-68-8", "50471-44-8",
"13457-18-6", "94-82-6", "93-76-5", "15972-60-8", "834-12-8",
"1912-24-9", "2642-71-9", "314-40-9", "1689-84-5", "470-90-6",
"1698-60-8", "1982-47-4", "2921-88-2", "15545-48-9", "21725-46-2",
"52315-07-8", "6190-65-4", "30125-63-4", "1007-28-9", "1014-69-3",
"333-41-5", "120-36-5", "62-73-7", "115-32-2", "83164-33-4",
"60-51-5", "298-04-4", "330-54-1", "38260-54-7", "122-14-5",
"93-72-1", "67564-91-4", "55-38-9", "96525-23-4", "51235-04-2"
)
img_l = list()
for (i in query) {
# url
qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
i,
"image?format=png&width=700&height=700",
sep = '/')
# query
Sys.sleep(1.5)
res = GET(qurl,
timeout(5))
img = content(res, type = "image) ### this is the change.
img_l[[i]] = img
message('Processed: ', i, ' (stored in list).') # ~ 8KB / image: object.size(img)
}
object.size(img_l)
332192 bytes
#plot image
library(magick)
img <- image_read(img_l[[1]]) #object of class "magick-image"
plot(img)
Hm, that sounds convincing @stitam and would also greatly reduce the RAM footprint.
So, something like this, where the user would only decide (1) whether to leave directory = NULL
(default) and return the image as a raw vector OR (2) to provide a directory = "path"
and have the images stored to disk. In the latter case, should we also return a data.frame with links?
require(httr)
require(dplyr)
query = c(
"25057-89-0", "72-43-5", "640-15-3", "82-68-8", "50471-44-8",
"13457-18-6", "94-82-6", "93-76-5", "15972-60-8", "834-12-8",
"1912-24-9", "2642-71-9", "314-40-9", "1689-84-5", "470-90-6",
"1698-60-8", "1982-47-4", "2921-88-2", "15545-48-9", "21725-46-2",
"52315-07-8", "6190-65-4", "30125-63-4", "1007-28-9", "1014-69-3",
"333-41-5", "120-36-5", "62-73-7", "115-32-2", "83164-33-4",
"60-51-5", "298-04-4", "330-54-1", "38260-54-7", "122-14-5",
"93-72-1", "67564-91-4", "55-38-9", "96525-23-4", "51235-04-2"
)
# dummy function
fun_img <- function(query, directory = NULL) {
foo <- function(query, directory) {
qurl = paste("https://cactus.nci.nih.gov/chemical/structure",
query,
"image?format=png&width=700&height=700",
sep = '/')
# query
Sys.sleep(1.5)
if (is.null(directory)) {
message("Retrieving raw image: ", query)
res <- GET(qurl,
timeout(5))
content(res, type = "image")
} else {
message("Storing image: ", query, " to: ", directory)
GET(qurl,
timeout(5),
write_disk(file.path(directory, paste0(query, ".png")), # depending on the webservice, more options here
overwrite = TRUE))
data.frame(query = query,
qurl = qurl,
stringsAsFactors = FALSE)
}
}
l <- lapply(query, foo, directory = directory)
names(l) <- query
if (is.null(directory)) {
l
} else {
dplyr::bind_rows(l)
}
}
img_l = fun_img(query[1:2])
lapply(img_l, head)
img_df = fun_img(query[1:2], directory = tempdir())
img_df
Good points about RAM. It looks like rcdk
also uses a temporary directory with tempfile()
for downloading molfiles, but it also returns an R object that you can plot with view.molecule.2d()
. That's the thing---I don't think there's a point of downloading files to a temporary directory if the user can't easily do something with them in R. I don't think I know enough about raw format images to comment on that.
raw
is not as universal as I thought, e.g. while image_read()
can convert from raw
ACToR images, it doesn't work with raw
CACTUS images. png::readPNG()
works with both web services, but the outputs are much larger than with image_read()
. Maybe it's not a good idea to return raw
afterall. BTW if we use png::readPNG()
and then image_read()
we can shrink the image size considerably;).
I think we should implement the functions as downloaders for now, because that works both with web services that provide urls, and with web services that don't, e.g. ChemSpider. We can always extend that functionality later.
We should make sure @gjgetzinger is looped in on this decision, since it probably changes some of our suggestions on PR #235
I've been following the discussion. No objections from my side. I'm happy to make any adjustments to PR #235 as needed. (full disclosure, I am in the middle of a job change and relocation, so will be a minute before I make any changes)
Good luck with your new endeavor! I relocated during last August:)
I think we should implement the functions as downloaders for now, because that works both with web services that provide urls, and with web services that don't, e.g. ChemSpider. We can always extend that functionality later.
To sum up, we agree to download the images and return a data.frame with the paths to the downloaded images. If no directory is supplied they will be downloaded to tempdir()
, the default. No raw images for now. Ok?
Argument examples (* mandatory ones):
query =
*from =
*dir =
*format =
width =
height =
any_other_argument =
verbose =
*Should we name the directory argument directory =
, dir =
or path
?
I prefer dir =
.
If the output is just file paths, I'd say return it as a vector, silently.
I'd also prefer data.frame because then we have query and result, similarly to get_()
. I like the idea of silent return, but I have no idea how that is implemented so I'll peek at your PRs:) Didn't we agree that tempdir()
is not good practice afterall? We can just leave dir
as a compulsory argument. I like dir
because of its brevity, also I think path
is ambiguous as it can refer to a directory or a file.
I'd also prefer data.frame because then we have query and result, similarly to
get_()
. I like the idea of silent return, but I have no idea how that is implemented so I'll peek at your PRs:) Didn't we agree thattempdir()
is not good practice afterall? We can just leavedir
as a compulsory argument. I likedir
because of its brevity, also I thinkpath
is ambiguous as it can refer to a directory or a file.
I also thin it's the best idea to have a mandatory dir =
argument.
Do you think it is useful to return anything to the console, apart from the optional verbose messages? If we want to import the downloaded images into R later, we can easily construct the paths in the same order with paste0(dir, "/", csids, ".png")
. I updated PR #253 for cs_img()
by making dir
mandatory and removing the tibble output. What do you think?
Seems fine to me! Might be good to print the path and filename to the console with message()
when verbose = TRUE
, but I don't think it needs to output anything.
Some web services allow us to retrieve images of substances. There are already implementations in PR #235 and PR #247. Also ChemSpider and PubChem both offer such functionality. I thought it would be good to open a small discussion about design considerations to ensure consistency.
*_view()
,*_viewer()
,_*img()
,*_image()
. Let's choose one and stick to it. I prefer*_img()
, as proposed by @andschar, e.g.actor_img()
,pc_img()
.query
,from
,format
,width
,height
,verbose
, in this order?output = c('image', 'download')
as suggested by @gjgetzinger, and another forpath
.On first thought, I think I would prefer these functions to return a list of raw character vectors. I tried png, svg, jpeg, these can all be accessed e.g. through
httr::content(httr::GET(url), type = "image")
. Raw vectors can then be rendered easily with image processing packages likemagick
, e.g.plot(as.raster(image_read(raw)))
and also exported in any format, but we wouldn't have to import them as dependencies, so webchem could remain lightweight.What do you think?