Closed mbojan closed 8 years ago
thanks for this @mbojan
Agree, would be good to allow writing to disk. Definitely interested in a PR
I'm not sure I'm sold on passing in any arbitrary function to do the writing though given that it's not likely to be clear to the user what data format they are passing to a write function, and so what function they'd need to use. Perhaps it makes more sense to have a few options for users? and document those so they are clear how to use them. E.g., looks like we have raw XML as text option, and parsed XML, we could have a set of functions as options for users
It would be nice to just write to disk in the GET
call using e.g., httr::write_disk()
, but we do need to grab the resumptiontoken from the result, so that seems to not be an option
Great.
General options for dumping that come to my mind:
as != "raw"
perhaps creating rds
files for each page (?)Options (1) and (3) seem rather straightforward.
For option (2) I am not sure whether it will be feasible because of plethora of possibilities how this could be done: different DB layouts, table and field names, writing semi-parsed results etc....
Perhaps the best option is to leave the possibility to inject any dumper function with a couple of predefined ones (e.g. like the ones above)?
As a start, I think we should do 1 and 3, then try to work in 2 later. Sound okay? we can open a separate issue for writing to a DB
Perhaps the best option is to leave the possibility to inject any dumper function with a couple of predefined ones (e.g. like the ones above)?
That sounds good.
Actually I need the DB functionality first. I will write a dumper which will be writing two fields: identifier and raw XML with the content. It is rather universal I guess and also an illustration how it could be done.
Any suggestions how to document it? while_oai
is not exported right now and dumpers do not require modifying other functions as arguments can be passed through ...
.
while_oai
? list_records
, list_identifiers
and so on? I like (3).
What do you think?
OK @sckott, I think I'm ready for a PR (branch dumpers
in my fork). However, I made two more modifications that might be useful. At this moment all three are in separate branches rebased onto your master
.
xml2::read_xml
that can catch some errors. In particular if there is an error about illegal characters in downloaded XML there is a restart that removes them. (I encountered this with working with my OAI-PMH service).handle_errors
(now as handle_errors2
) so that it takes parsed XML as input, not raw text. XML is alread parsed anyway so there is no need to do it twice.Ideally I would like to submit a PR of all three with dumpers rebased on top of (1) and (2). I need these anyway.
sorry, I was on vacation away from computer for a few days.
Any suggestions how to document it?
A combination of 2 and 3. Each function that can use these parameters should have documentation for those parameters. Also, create a separate man file to go into more details about the dumpers, and can link to from each of the main functions (e.g., list_records()
)
Ideally I would like to submit a PR of all three with dumpers rebased on top of (1) and (2). I need these anyway.
I prefer smaller PR's, so 1 and 2 separately or together, then the dumpers, would be ideal
Thanks for this pkg!
I need to harvest and process rather large requests and merging all results of a single request into one R object is a nono (not enough RAM). For that purpose a forked (https://github.com/mbojan/oai) and modified
while_oai
to call somedumper
function that might e.g. write results to a DB page-by-page (resumptionToken).Perhaps you are interested in such an enhancement? Suggestions welcome. I'm still working on it.