ropensci-archive / crminer

:no_entry: ARCHIVED :no_entry: Fetch 'Scholary' Full Text from 'Crossref'
Other
17 stars 5 forks source link

crm_text - Specifying absolute path and file names for PDFs #27

Closed brunj7 closed 6 years ago

brunj7 commented 6 years ago

Hi @sckott,

Thank you for the great package! I was wondering if there is a way to:

If not, would you recommend to usecrm_cache$list() to keep the link between the reference information and the downloaded filenames?

Thanks! Julien

sckott commented 6 years ago

hi @brunj7 yes, you can change the cache location. see ?crm_cache manual file and in particular crm_cache$cache_path_set() should help you set a different cache path. Let me know if that works

brunj7 commented 6 years ago

sorry @sckott , I should have mentioned that I tried crm_cache$cache_path_set(), but thought from its behavior that it was intentionally restricted to cache.

crm_cache$cache_path_get()
#[1] "/Users/brun/Library/Caches/R/crminer"

crm_cache$cache_path_set("/Users/brun/Desktop/soil_pdfs")
[1] "/Users/brun/Library/Caches/R//Users/brun/Desktop/soil_pdfs"

crm_cache
#<hoard> 
#  path: /Users/brun/Desktop/soil_pdfs
#  cache path: /Users/brun/Library/Caches/R//Users/brun/Desktop/soil_pdfs

I do find the pdfs I download under Users/brun/Library/Caches/R/Users/brun/Desktop/soil_pdfs but there is no directory created under the Desktop.

 devtools::session_info()
# Session info -------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.423)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2018-04-16       

Packages (extract)-----------------------------------------------------------------------------------------------------------
 package       * version   date       source    
 crminer       * 0.1.4     2017-08-12 CRAN (R 3.4.1)                                                                          
 curl          * 3.2       2018-03-28 cran (@3.2) 
 hoardr          0.2.0     2017-05-10 CRAN (R 3.4.0)
sckott commented 6 years ago

thanks for the further details.

here's the arguments you can pass to cache_path_set:

function (path, type = "user_cache_dir", prefix = "R")

Which get put together like: type + prefix + path

type can accept a function, although it has to be in a character string. e.,g

crm_cache$cache_path_set(path = "foobar", type = "function() '~/stuff'")
#> [1] "~/stuff/R/foobar"

so you could do

crm_cache$cache_path_set(path = "soil_pdfs", type = "function() '/Users/brun'", prefix = "Desktop")
#> [1] "/Users/brun/Desktop/soil_pdfs"

Let me know if that works

this of course could be easier, but i figured that most users wouldn't want to change the directory for caching files. if I get enough feedback otherwise, I could change this to make it easier (i.e. just pass in the full path as you tried above)

brunj7 commented 6 years ago

Thanks @sckott it worked! Our main motivation to change the download location is to store the PDFs in a shared directory on a server... I do not know how often this will be relevant to crminer users' workflow, but for systematic review there are often many reviewers (or centralizing text mining ingestion). I am closing the issue. Thanks again for your help!

sckott commented 6 years ago

Thanks for more details on the use case.

brunj7 commented 6 years ago

Hi @sckott,

I have one more question regarding how crminer handles setting the cache. When I was trying to use a variable to set the path and it was working:

llibrary(crminer)

#Set cache folder
outfilepath <- "~/Desktop/test_pdf"
# Get the full path
outfilepathfull <- dirname(outfilepath)
# Set the cache
crminer::crm_cache$cache_path_set(path = "", type = "function() outfilepathfull", prefix=basename(outfilepathfull))
#[1] "/Users/brun/Desktop/Desktop/"

However as soon as I was using this code in a function, I was getting an error that the variable was not defined:

iset_the_path <- function(cachepath){
  cachepathfull <- dirname(cachepath)
  crminer::crm_cache$cache_path_set(path = "", type = "function() cachepathfull", prefix=basename(cachepathfull))
}

iset_the_path(outfilepath)
#Error in eval(parse(text = type))() : object 'cachepathfull' not found 

Using the global environment assignment <<- when defining the variable in the function worked:

iset_the_path <- function(cachepath){
  cachepathfull <<- dirname(cachepath)
  crminer::crm_cache$cache_path_set(path = "", type = "function() cachepathfull", prefix=basename(cachepathfull))
}

iset_the_path(outfilepath)
#[1] "/Users/brun/Desktop/Desktop/"

but I was wondering if this is the best way to do this? Sorry I am not too familiar with R6 object and how the cache work in crminer.

Thank you for any help/insights! Julien

sckott commented 6 years ago

thanks for your question @brunj7

for now, it's not elegant, but i think something like this is easiest:

x <- crminer::crm_cache
iset_the_path <- function(path) x$.__enclos_env__$private$hoard_env$cache_path <- path
iset_the_path("~/Desktop/test_pdf")
x

I'll probably make a new method in hoardr to just set the full path directly. see https://github.com/ropensci/hoardr/issues/12

brunj7 commented 6 years ago

Ok, great; I'll use your suggestion. And +1 on a new method to set the path. Thank you for the great package(s)!

sckott commented 6 years ago

@brunj7 finally made the fix in hoardr and now here you can set the full cache path directly, going to CRAN soon