ropensci / oai

OAI-PMH R client
https://docs.ropensci.org/oai
Other
14 stars 4 forks source link

Handling big requests #9

Closed mbojan closed 8 years ago

mbojan commented 8 years ago

Thanks for this pkg!

I need to harvest and process rather large requests and merging all results of a single request into one R object is a nono (not enough RAM). For that purpose a forked (https://github.com/mbojan/oai) and modified while_oai to call some dumper function that might e.g. write results to a DB page-by-page (resumptionToken).

Perhaps you are interested in such an enhancement? Suggestions welcome. I'm still working on it.

sckott commented 8 years ago

thanks for this @mbojan

Agree, would be good to allow writing to disk. Definitely interested in a PR

I'm not sure I'm sold on passing in any arbitrary function to do the writing though given that it's not likely to be clear to the user what data format they are passing to a write function, and so what function they'd need to use. Perhaps it makes more sense to have a few options for users? and document those so they are clear how to use them. E.g., looks like we have raw XML as text option, and parsed XML, we could have a set of functions as options for users

It would be nice to just write to disk in the GET call using e.g., httr::write_disk(), but we do need to grab the resumptiontoken from the result, so that seems to not be an option

mbojan commented 8 years ago

Great.

General options for dumping that come to my mind:

  1. Writing raw XML text to file or a connection (http://www.rdocumentation.org/packages/base/functions/connections) in general
  2. Writing to a DB
  3. If as != "raw" perhaps creating rds files for each page (?)

Options (1) and (3) seem rather straightforward.

For option (2) I am not sure whether it will be feasible because of plethora of possibilities how this could be done: different DB layouts, table and field names, writing semi-parsed results etc....

Perhaps the best option is to leave the possibility to inject any dumper function with a couple of predefined ones (e.g. like the ones above)?

sckott commented 8 years ago

As a start, I think we should do 1 and 3, then try to work in 2 later. Sound okay? we can open a separate issue for writing to a DB

Perhaps the best option is to leave the possibility to inject any dumper function with a couple of predefined ones (e.g. like the ones above)?

That sounds good.

mbojan commented 8 years ago

Actually I need the DB functionality first. I will write a dumper which will be writing two fields: identifier and raw XML with the content. It is rather universal I guess and also an illustration how it could be done.

Any suggestions how to document it? while_oai is not exported right now and dumpers do not require modifying other functions as arguments can be passed through ....

  1. Should we export while_oai?
  2. Add dumper-related arguments to all list_records, list_identifiers and so on?
  3. Write dumpers in a separate file and explain there how they can be used?

I like (3).

What do you think?

mbojan commented 8 years ago

OK @sckott, I think I'm ready for a PR (branch dumpers in my fork). However, I made two more modifications that might be useful. At this moment all three are in separate branches rebased onto your master.

  1. Reading XMLs safely. I wrote a wrapper over xml2::read_xml that can catch some errors. In particular if there is an error about illegal characters in downloaded XML there is a restart that removes them. (I encountered this with working with my OAI-PMH service).
  2. See #10 . I slightly modified handle_errors (now as handle_errors2) so that it takes parsed XML as input, not raw text. XML is alread parsed anyway so there is no need to do it twice.

Ideally I would like to submit a PR of all three with dumpers rebased on top of (1) and (2). I need these anyway.

sckott commented 8 years ago

sorry, I was on vacation away from computer for a few days.

Any suggestions how to document it?

A combination of 2 and 3. Each function that can use these parameters should have documentation for those parameters. Also, create a separate man file to go into more details about the dumpers, and can link to from each of the main functions (e.g., list_records())

sckott commented 8 years ago

Ideally I would like to submit a PR of all three with dumpers rebased on top of (1) and (2). I need these anyway.

I prefer smaller PR's, so 1 and 2 separately or together, then the dumpers, would be ideal