Submission: epubr - Githubissues

ropensci / software-review

rOpenSci Software Peer Review.

292 stars 104 forks source link

Submission: epubr #222

Closed leonawicz closed 6 years ago

leonawicz commented 6 years ago

Summary

What does this package do? (explain in 50 words or less):

Extract, read and parse EPUB format e-book file archive metadata and book text into tidy data frames to prepare for subsequent text analysis.

Paste the full DESCRIPTION file inside a code block below:

Package: epubr
Version: 0.4.0.9000
Title: Read EPUB File Metadata and Text
Description: Provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book formatting is non-standard enough across all literature that no function can curate parsed e-book content across an arbitrary collection of e-books, in completely general form, resulting in a singular, consistently formatted output containing all the same variables. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary e-books.  However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. Text is read 'as is'. Additional text cleaning should be performed by the user at their discretion, such as with functions from packages like 'tm' or 'qdap'.
Authors@R: person("Matthew", "Leonawicz", email = "mfleonawicz@alaska.edu", role = c("aut", "cre"))
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
ByteCompile: true
URL: https://github.com/leonawicz/epubr
BugReports: https://github.com/leonawicz/epubr/issues
Suggests:
    testthat,
    knitr,
    rmarkdown,
    lintr,
    covr,
    readr
Imports: 
    xml2,
    xslt,
    magrittr,
    dplyr,
    purrr,
    tidyr
VignetteBuilder: knitr
RoxygenNote: 6.0.1

URL for the package (the development repository, not a stylized html page):

https://github.com/leonawicz/epubr

Please indicate which category or categories from our package fit policies this package falls under *and why

Data extraction. This package focuses on importing EPUB file metadata and data into R in a useful format. It strips xml tags and formatting to focus on the readable, meaningful text. While future package versions will expand functionality around the edges, the core purpose will remain the data extraction described.

Who is the target audience and what are scientific applications of this package?

Data analysts or researchers performing text mining and other language analysis involving individual books or book collections in a typical, unrestricted EPUB file format.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

I have not found other R packages that do this.

This package is already on CRAN (v0.4.0).

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [ ] The package has an obvious research application according to JOSS's definition.
- [ ] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [ ] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

To the best of my knowledge, though it's quite possible I missed something small or have something not exactly as required.

If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

hrbrmstr commented 6 years ago

Greetings, Matt!

I am technically not a "reviewer" but one of the best "connectors" in the R univeRse connected your pkg with a quick hack of mine from earlier this year (pubcrawl).

First, congrad on the CRAN acceptance! You had to do the "roll up the sleeves" work to deal with file extraction to do that and I lazily relied on the archive package (which is not on CRAN) for that and managing unarchiving is never fun code to write so serious kudos for that. We took some rly similar approaches (which is kinda super cool IMO) and you went a few extra miles (so, again, 👍 ).

The only thing I rly cld suggest (besides a personal preference to remove some tidyverse dependencies) is to take a look at https://github.com/hrbrmstr/pubcrawl/blob/master/inst/xslt/justthetext.xslt as that contains set of "cleaner" XSL template matchers garnered from spending way too much time dissecting various "readability" idioms. Yours has a good catch-all, and a "fun" (I have weird ideas of fun) exercise might be to extract tags from a large or at least a decent representative random corpus of epubs to see which ones tend to be there and have a secondary template that explicitly targets around them. This may not be necessary at all given that epubs aren't as "evil" as general web pages and a catchall might be fine.

Again, great job! And, if there's anything (which is unlikely) you need from my pkg carve away (and no need for attribution since I just made mine due to a random request on Twitter). Also, if you need resources for the suggested corpus/xslt testing, def lemme know.

I will let the amazing rOpenSci folks go back to their regularly scheduled reviewing activities now :-)

maelle commented 6 years ago

Thanks a lot @hrbrmstr!

Thanks for your submission @leonawicz, I'll proceed with other checks now that overlap with pubcrawl is clarified.

leonawicz commented 6 years ago

Thank you both! Much appreciated.

@hrbrmstr I overlooked pubcrawl somehow but I had initially began by looking at your hgr package, specifically the cleaner template here (looks the same as direct from xslt) https://github.com/hrbrmstr/hgr/blob/master/inst/xslt/justthetext.xslt

About the justthetext.xslt

I sent you a couple emails about it on May 27/28 that might have fallen through the cracks but also the second email basically said to disregard the first because I'd made a mistake. Essentially it turned out that almost every line in the cleaner template had zero impact on my (non-representative) sample of epubs. At first I was surprised that almost the whole cleaner template wasn't doing anything, but then it made sense when I considered what many of the tags were named. I guess they don't show up in epubs really.

But it was not quite so simple as removing lines that had no impact. There were a small number of lines in the cleaner template that actually appeared to have a detrimental effect on what type of text was returned for some sections of an epub. I did the best I could from the personal sample I had (plus some project Gutenberg, but again, that is only representative of project Gutenberg as far as I can tell) to make a judicious selection for the filter that seemed to cause the least problems and returned text most ideally.

What I ended up with is what you see in my version in inst/text.xml. It's virtually empty and this actually worked best in my tests. In fact it's so empty I wanted to remove it entirely so that I could drop another package dependency, xslt. But removing the call to the near-empty cleaner template resulted in epub text coming back still looking uncleaned unfortunately.

Other dependencies

Normally I'm partial to removing as many dependencies as possible but in this case I favor retaining minimal tidyverse package imports. I did make readr optional via Suggests and using a call to requireNamespace to check for availability or fall back on base R's readLines. But I found it a real chore to do the rest without packages like purrr, dplyr and tidyr. More importantly, I wanted users to receive tidyverse tibbles as returned values, especially considering the horrid console print calls that could otherwise be made inadvertently on data frames containing millions of text characters from books. I didn't see a reason to try to reinvent those tidyverse wheels with custom print methods and whatnot. So I decided to retain functionality and presentation using familiar tidyverse tools that users reading book collections into R are very likely to be using as part of their workflow after reading in the book anyhow.

Other stuff

I was driving relatively blind with the cleaner template, but considering the best results I got were when using a virtually blank template, it'd be great to know if there is some way to remove the xslt dependency and do whatever it is doing by using a small custom R function for text substitution. I don't actually know if this idea is realistic or more trouble than it's worth at the moment.

I like the idea of having a more representative sample of epubs from various publishers. I don't have any ideas on where to get this. There's also a sampling issue. Many epubs users may elect to read in may not have DRM but are still licensed books that can't be shared as part of a corpus. I don't have stats on this but I wouldn't be surprised if an available corpus of books from a variety of sources still won't represent the bulk of books most people encounter since it would be missing all the books most people purchase these days.

Even with a better non-representative sample, it's a painstaking process. You still have to kind of "check" (minimally) each book by eye to see if the filter did something unusual to a given book. Eventually I had to accept that as much as I'd prefer to have figured out all the gotchas first so that I could minimize susbequent GitHub issues being reported, I was going to need to release the package and wait to see if/what people report as strange.

Thanks so much again Bob and let me know if you have any more thoughts re: mine! :) Matt

hrbrmstr commented 6 years ago

Email & I have been bitter, bitter enemies of late. So it's more of a "wall" between me & it vs bits slipping through cracks :-)

I had an inkling that the style sheet transformer might be super aggressive for an epub (and epubs have, er, "diverse" XML formats, too) that your straightforward approach is likely much, much safer.

I was going to suggest using https://github.com/hrbrmstr/freebase (since it'd cover the purrr and some dplyr use cases in the pkg pretty well) but you'd still need tidyr. Plus, it's super likely that a majority of the pkg audience will be tidyverse users.

If it's OK with y'all I'll likely set the pubcrawl repo to an "archive" state and replace the README with a link to the rOpenSci'd version of epubr once the review is done. If so, pls @ me on the final review thread box and I'll get'er done.

Again, rly nice work!

leonawicz commented 6 years ago

Thanks! :) I'll have to go back and look at the details when I have a chance, but I can probably remove purrr in a later package version by replacing with some uglier code, but I think dplyr is the heavier dependency, and unfortunately the one I particularly want to retain.

The diversity of formats is also why I could not be as aggressive as I would have liked with things like the optional attempts at chapter identification. To some degree that will have to be left to the user so that they can adjust things on a case by case basis without epubr ruining their day. I was bummed to see that of all epubs I looked at, go figure, the example Project Gutenberg book I included as an example file in the package has some of the most problematic formatting. That bummed me out haha. I'm guessing it's probably really common, but so far it's the only book I've seen where the internal html files that mark the different book sections actually begin and end in the middle of chapters, as though they cared more about keeping each html file approximately the same size rather than breaking the text into meaningful sections. So lame! :)

In a future version I would like to include a function that loads a book in the browser as a simple e-reader using shiny perhaps, but after seeing the arbitrary section breaks in that book I had to put that thought on hold.

Thanks again for your help!

maelle commented 6 years ago

Nice to see such a constructive discussion even before the actual review process starts 😀 Thanks again for chiming in @hrbrmstr

Btw my checks are still pending but I'll probably recommend to add a link to https://github.com/ropenscilabs/gutenbergr in the README @leonawicz 😉

leonawicz commented 6 years ago

@maelle Done, via latest commit. Thank you for the suggestion. It's under the Reference heading. Let me know if I should format it differently.

maelle commented 6 years ago

Thanks again for your submission @leonawicz!

Editor checks:

[x] Fit: The package meets criteria for fit and overlap
[x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
[x] License: The package has a CRAN or OSI accepted license
[x] Repository: The repository link resolves correctly
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

The README states "Subsequent versions will include a more robust set of features.". Before I start looking for reviewers, could you

List what these features would be (e.g. as issues in your repo associated to a milestone )
Confirm you don't want to implement any of them before or during the review process.

We can put this submission on hold so that the reviewers look at a more finished product, which in my opinion would help you make the most of our onboarding process.

In any case, writing down what your ideas are will help lessen duplication of efforts, since reviewers often suggest enhancements.

Reviewers: @calebasaraba @suzanbaert Due date: 2018-07-02

leonawicz commented 6 years ago

@maelle Thanks so much!

I have removed that line from the readme. I do not have a clear concept at this time of specific enhancements I plan to include. I had tentatively thought about including a demo Shiny app that acts as a browser-based epub reader, but it may be quite a bit more complicated than I thought. It is not something I have considered in much detail yet.

I do not plan to add any specific functionality at this time. Pending any issues and feature requests users may submit later on via GitHub, that would probably give me a better sense of what kinds of additional functions might be helpful to other users. So far I have not received any user feedback so I do not plan to make any changes.

maelle commented 6 years ago

Thanks for your answer @leonawicz! I'll now look for reviewers!

hrbrmstr commented 6 years ago

It occurred to me that I've already been in the code and know a bit abt the data format if you do need a volunteer :-)

maelle commented 6 years ago

@hrbrmstr thanks a lot for volunteering! You've however reviewed another package recently so I asked other reviewers before seeing your message, sorry about that and thanks again for offering your expertise!

@leonawicz you can now add a peer-review badge to the README of your package

[![](https://badges.ropensci.org/222_status.svg)](https://github.com/ropensci/onboarding/issues/222)

@calebasaraba @suzanbaert thanks a lot for accepting to review this package! Your reviews are due on 2018-07-02. As a reminder here are links to the recently updated reviewing and packaging guides and to the review template.

leonawicz commented 6 years ago

Badge added. Thank you.

maelle commented 6 years ago

@calebasaraba @suzanbaert 👋 Friendly reminder that your reviews are due on Monday, 2018-07-02. Thank you! 😺

suzanbaert commented 6 years ago

Yes! I already did some experimentation and i have time set apart on monday to finish the last few items. But first a beautiful summer sunday 😉

maelle commented 6 years ago

Thanks, have a good Sunday! 🌞

suzanbaert commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions in R help
[X] Examples for all exported functions in R Help that run successfully locally
[ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[X] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4

Review Comments

First of all, I love the idea of the package and the ability to add a vector of epubs. I also was quite amazed at the speed, I would have thought it was going to take longer to turn a few quite sizeable books into an R object. Functions also work as part of a pipe, and plays very nice with dplyr.

A few suggestions I wrote down along the way:

Error messages

When adding a non valid file (file doesn't exist or was wrongly spelled) the error message is quite mysterious: Error in if (grepl("<|>", x)) { : argument is of length zero In addition: Warning message: In utils::unzip(file, exdir = exdir) : error 1 in extracting from zip file. It might be worth adding a friendlier error message that just reads "file could not be found" or "file is not a valid epub file" or something similar?
Similarly, it seems that title is a mandatory field in the fields argument. When 'title' is not present, you get an error: Error: Column 'title' must be a 1d atomic vector or a list. In addition: Warning message: Unknown or uninitialised column: 'title'. I would suggest to either make 'title' a standard field that automatically gets added (like data) or to make title unnecessary.

Function naming Secondly, regarding the function names, epub_meta and epub_unzip are perfectly in line with the rOpenSci's packaging guidelines, but the other two are not. I was wondering especially about epub, it would intuitively make more sense to name it read_epub() which would make it more in line with nearly all other import functions in R people will know about. Alternative could be epub_import or something similar?

Similarly with first_nchar, perhaps something like epub_glimpse() or anything similar that does start with epub_ and makes it clear you're getting a sneak peek on the file?

README documentation Last but not least, though everything is in the documentation and R veterans probably will have no issues, I would make the documentation a bit more beginner's friendly.
One element in particular caught my eye: the instructrions in the readme to import an epub. It's probably silly, but when I first installed and tried to import an epub (i was in play-modus and so not very concentrated), but I also did a system.call() followed by epub(). I somehow thought I had to make a system file from my epub and then sort of unzip it. (Needless to say that I never used the system.call() function before so wasn't entirely sure what it did...)

It took only a mysterious error and a bit more focus to realize my silly mistake, but I did wonder whether the README could just be a bit clearer. Perhaps something along these lines? Tiny change, but maybe a few frowns less from people first using the package?

#importing an epub file
x <- epub("path/to/epubfile.epub")

#to use the example dracula.epub that is built-in with epubr:
file <- system.file("dracula.epub", package = "epubr")
x <- epub(file)

Output of devtools checks:

Devtools checks

# results devtools::test()
test-lintr.R:4: warning: Package Style
incomplete final line found on 'E:/Rprojects/epubr/.lintr'

test-lintr.R:4: error: Package Style
Invalid DCF format.
Regular lines must have a tag.
Offending lines start with:
  inst/.lintr
1: lintr::expect_lint_free() at E:\Rprojects\epubr/tests/testthat/test-lintr.R:4
2: lint_package(...)
3: read_settings(path)
4: read.dcf(config_file, all = TRUE)
5: stop(gettextf("Invalid DCF format.\nRegular lines must have a tag.\nOffending lines start with:\n%s", 
       paste0("  ", lines, collapse = "\n")), domain = NA)

Regarding devtools::check(), it complained about two packages i do not have installed on this PC, but there also was a warning regarding qpdf.

# results devtools::check()

Status: 1 WARNING, 1 NOTE

* checking package dependencies ... NOTE
Packages suggested but not available for checking: 'lintr' 'covr'

* checking for unstated dependencies in examples ... OK
 WARNING
'qpdf' is needed for checks on size reduction of PDFs

suzanbaert commented 6 years ago

One more thing: I originally planned to get a bit more into the code of the functions itself, but I ran out of time. So my review above is me as user of the package, not into the actual code.

maelle commented 6 years ago

Thanks a lot for your review @suzanbaert! 😸 And thanks for the disclaimer as well!

calebasaraba commented 6 years ago

Package Review

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

6-8

Review Comments

I think this is a solid idea for a package and it accomplishes its main goal well: to provide users with functions to transform epub formatted ebooks into neatly nested R dataframes that can be used for further textual analyses. Very well done!

I found that the package was easy to use and performed well at these actions using R 3.5 on windows and mac systems. I wrote down some of my thoughts below:

Function commenting While working through the code in epub.R, I had a bit of a hard time parsing the multitude of if statements. I understand that they are necessary, but I was wondering if there was a stylistic way to structure them in the code to be easier for a new user sifting through the function details. Perhaps one way would be to change some of the single-line if statements into curly-braced statements with indentations, or providing a few more in-line comments to explain what's happening. I mention this because, due to the heterogeneity of epub formats, I think a lot of your more adventurous users will likely be looking at your actual code while trying to learn more about epub processing and how they can tweak it to their own unique situation / format.

Function naming It may make more sense to change the epub() function name to the object_verb style recommended by the ROpenSci packaging guidelines, something like epub_import() or switching its name with the internal epub_read(). Similarly, first_nchar() could be something like epub_head().

Documentation note I understand from your post above that you're trying to limit dependencies, and you've done an admirable job. I'm not sure if you avoided it for a particular reason, but I've found it easier to use @importFrom in roxygen2 when documenting a particular function rather than remembering what packages I'm using throughout and adding them to the DESCRIPTION manually, which I think is how you did it here?

Chapter separation functionality I didn't have any problems using the epub() or epub_unzip() functions on either mac or windows using R version 3.5.

I downloaded several epub files from project gutenberg and played around with loading them in with the epubr package. In each instance, the metadata fields were easily readable and correctly formatted, but I could never really get the chapter separation to work very well, and I did not find the default item separation useful / meaningful. I understand how challenging it would be to get this sort of separation functionality to work across a wide variety of epub files, and you've put in effort to make it work at all -- but I wonder if it philosophically makes sense to make chapter and/or item separation the default behavior if it is not very generalizable (or in the case of item, that useful?). Perhaps it would be better have the default nested dataframe structured more like the output from gutenbergr, which seems to be a straightforward reproduction of the text present in the epub file?

Documentation / Vignettes I like that you chose to include an epub file (Dracula) that was poorly formatted for demonstration purposes in your vignette, but it also might be instructive for your users to show an example where the chapter separation works smoothly, if possible (this would be especially true if you change the default chapter/item separation behavior and want to highlight this as optional functionality).

On the topic of chapter separation / funky formatting, you include the following paragraph that refers to some sort-of hidden functionality:

These developmental arguments are currently undocumented, 
though they can be explored if you are inclined to read the 
package source code and pass additional arguments to .... 
They have been tested successfully on many e-books, but 
certainly not a representative sample of all e-books. 
The approaches these arguments use may also change before
they are formally supported and explicitly added to a future
version of the package.

I wonder if it is a necessary addition to the vignette, since without documentation or examples of test cases, it only serves to tantalize the reader! I noticed earlier in the submission issue you mentioned that you didn't have any concrete next steps for development, so perhaps some examples or explanation of these arguments could be one; alternatively, you could consider removing this reference from your documentation.

maelle commented 6 years ago

Thanks a lot for your review @calebasaraba! 🥅 ⚽️ 😃

@leonawicz now both reviews are in! As per our author's guide we ask that you respond to reviewers’ comments within 2 weeks of the last-submitted review, but you may make updates to your package or respond at any time.

Two notes:

Reg imports I don't know what @calebasaraba meant since I haven't looked at the code as thoroughlyt as he did but as a reminder we recommend using the pkg::function syntax. Cf http://r-pkgs.had.co.nz/namespace.html#imports It doesn't add packages to DESCRIPTION though. usethis::use_package allows one to add packages to DESCRIPTION more automatically, and this new package of Hadley Wickham's might help finding all used dependencies.
If you rename the functions have a look at our guidance regarding "changing stuff in your package"

leonawicz commented 6 years ago

Hi all! Thank you all so much for your feedback and hard work! It is much appreciated. :) Some changes have been implemented already. Those and other replies below.

Re: cryptic error message.

Agreed, thank you for catching that. I have made more informative error messages for missing files and non-epub files, for epub, epub_meta, and epub_unzip.

Re: required title field.

This was a big issue. Thank you again. I have made title automatic like data.

Context: title is important because it acts as the most likely/dependable field to uniquely identify books when file is a vector.

Of course, multiple books could have the same title, such as multiple editions of one book. Title is not guaranteed to be unique. However, it is the best bet. As such, I have made title the field that falls back on file names (mostly likely unique identifier) when title is not a field in the metadata. title can also be passed as an additional argument (...) to epub so the user can override the name of the metadata entry containing title information. See updated help doc for details.

I have not encountered any epubs yet lacking a title field, but I did my best to test this out by passing the additional title argument using non-existing field names.

Re: function naming.

I am open to renaming first_nchar to something like epub_text, epub_head, epub_glimpse. I suppose given the way it works, I am partial to epub_head. There could also be a tail function (though probably not much use), or a custom slice (could be useful), in addition to just looking at the first n characters. I think this makes *head the best suffix for the current function.

I'm not inclined to lengthen epub with an additional verb. In my view, epub is meaningful, clear, but also appropriate for such a small package with such a singular use. It's easier to view it as the master, over-arching package function. It also does not live alongside the extraneous functions like epub_meta and epub_unzip but wraps around them and encapsulates the whole package purpose. I don't think there is really anything else in such a small package to confuse epub's use with.

Re: documentation.

The example using system.file is pretty standard, but I have added a comment above the line to make it more clear what it means.

Re: vignette.

I agree I probably did more harm than good by indicating that there is developmental functionality potentially accessible via ... that I want to integrate explicitly later but which remains currently far too open to changing approaches at this time. I have removed the reference from the vignette.

Re: dependencies/import note

I'm confused regarding this comment, sorry. I used usethis and explicit namespaces in the package. I used @importFrom only if necessary, e.g. @importFrom magrittr %>%.

Re: chapter separation

I'm unclear about "did not find the default item separation useful / meaningful". Because this is generally unpredictable, the default is to attempt no chapter identification. drop_sections and chapter_pattern are both NULL unless the user says otherwise in the context of their known e-books.

The default is to simply dump each book section into its own row of the data data frame (you may mean this). My approach is to leave this structure as is, and users can alter it further however they want. For many books, this is a very good format because this does give a meaningful and useful breakout of sections for many books. It also is in keeping with the general approach in epubr of not altering the file content (e.g., by automatic text cleaning steps that users can't prevent) other than to basically strip the xml tags. Things are made human readable but otherwise left close to the original table structure.

Related suggestion and some context:

It is very tempting to us, but Project Gutenberg maybe is getting too much attention :)

I think we should not pay much attention to Project Gutenberg (PG) texts honestly. It is a convenient place to grab a public domain epub file for package testing and ability to share the content/examples. In fact, this is why I included one in the package. However, my perspective is that PG is potentially highly unrepresentative of the e-books users will tend to load with epubr as well as not a very realistic use case:

I have looked at books from a variety of publishers. I don't consider this to be representative of "all epubs" of course, but I did notice that none had formatting like PG books I've looked at (which I admit is far fewer). I have looked at a lot of licensed/copyrighted books, the type you have to purchase, from a range of publishers and years. I consider these much more representative of the types of e-books users will load with epubr. This was the motivation for the package.

Much of the public domain stuff can be obtained in multiple formats and pulled directly into R from the web. epubr is better for when you have local files, and likely nothing but the local files in this single format. For licensed/purchased books, users probably only have one copy in one format. It may not even be epub originally, but they may have locally converted it to epub (say, from .mobi after buying from Amazon, or something similar) using other software like Calibre. This is more fundamentally what epubr is for.

With the goal of loading a novel's text into R to be used in a subsequent text analysis, when it comes to PG there is no reason at all to use epubr. To do so requires creating an unnecessary intermediary step of obtaining local epub format files from PG, then loading them into R with epubr, but none of that is necessary. All of that can easily be skipped, for example by just using gutenbergr if I'm not mistaken.

So overall, I think PG is not only not a good/gold standard for evaluation, but actually an exceptional edge case/not a representative use case for epubr or of epubs in general. Second, PG also has no need at all for an epubr or other type of intermediary epub format processing stage. It's hard to say what is most representative of "all books", or at least "all popular books," but I am pretty confident that PG is not it. I would say epubr is more for use cases where the files themselves, the users' collections of local epub files, represent the bottleneck of their entire analysis, in that it may be the sole source of the particular data that a user has access to. Apart from specific research involving PG titles, the intersection of most users' e-book collections with the public domain works on PG is probably very small.

Thoughts?

Re: Devtools checks

test-lintr.R:4: warning: Package Style incomplete final line found on 'E:/Rprojects/epubr/.lintr'

I think this can be ignored. It is possible that if you pull the repo down to some systems, the symlink stops being seen as a symlink (Windows?). On my local Windows it is fine (no warnings). Recreating the symlink can fix the problem on a local system. But this passes on Windows, Mac and Linux (Travis-CI, I think with warnings as errors). This file just links to the real lintr file in the inst directory. This setup may look weird but it has to do with getting lintr to work with unit testing on systems like Travis-CI for package builds.

test-lintr.R:4: error: Package Style Invalid DCF format. Regular lines must have a tag. Offending lines start with: inst/.lintr 1: lintr::expect_lint_free() at E:\Rprojects\epubr/tests/testthat/test-lintr.R:4 2: lint_package(...) 3: read_settings(path) 4: read.dcf(config_file, all = TRUE) 5: stop(gettextf("Invalid DCF format.\nRegular lines must have a tag.\nOffending lines start with:\n%s", paste0(" ", lines, collapse = "\n")), domain = NA)

I'm not sure if this is more of the same. I am unable to reproduce these errors locally or remotely.

"warning regarding qpdf."

It's been a long time since I installed qpdf, so I don't remember the exact complaint, but if I didn't have it installed then local package builds threw some kind of message or warning about it not being available.

Status: 1 WARNING, 1 NOTE

checking package dependencies ... NOTE Packages suggested but not available for checking: 'lintr' 'covr'

I think this is normal, just need to have them on the system if you are building a package that uses them. I'm saying "I think" a lot, which is just to say, if anything I'm saying is not quite right, please feel free to jump in and set things on the right track.

checking for unstated dependencies in examples ... OK WARNING 'qpdf' is needed for checks on size reduction of PDFs

Yeah, I think this is the warning I get rid of by making sure qpdf is installed.

Re: function commenting/code

This is one area where it's possible it may have been premature for me to submit my package for review? I can't speak to what other downloaders of epubr are doing with it, but just myself I have plenty of unofficial use cases already in other projects that are dependent on the still-evolving/undocumented code inside some of the package functions. I can't remove it, but I also can't really say what it will look like later (or when I will get to it, since this is a personal project).

I published to CRAN because what was official/documented, was ready to go. The parts that are in the works though, it's just too early to say exactly what that will look like and trying to document it or give examples would be getting into the weeds I think. Please let me know what you think. I can always resubmit in the future if there is too much problematic about the state of some of the internals.

Thanks again so much! :) Please forgive me if I forgot to address some comments. Or if I shouldn't have addressed both reviews in one reply, but I saw some overlap. It took a while to compile this (probably not nearly as long as it took you to review the package code and I am very grateful). Please let me know if I missed anything or confused anything.

calebasaraba commented 6 years ago

It was a fun time taking a look at the epubr package, nice work! I tried to continue the exchange by addressing topics I thought were from my review:

Re: dependencies/import note Feel free to ignore this comment of mine -- I have been using @importFrom as the standard documentation strategy in my own practice, but the usethis method both you and @maelle have mentioned is better. Always learning!

Re: function naming Ultimately the naming of the epub function is up to you, but I like functions with verbs -- I think the read_epub suggestion by @suzanbaert was nice since there are so many import functions that follow that style in R packages already. However, I can agree to disagree about this stylistic issue!

Re: chapter separation / additional context Thank you for giving the additional context as well as explaining the peculiarities of project gutenberg epub files. I think I understand more clearly what you envision the purpose of the epubr package to be, and that you think most users will not have a large number of project gutenberg style files. I suppose when I think about big text analyses I often think about what's easily available in the public domain, but there's no reason that must be the case.

I did notice that the default settings for drop_sections and chapter_pattern were NULL, but was, as you guessed, trying to refer to the way sections were split up into the rows of the nested dataframe. It may have just been the small unrepresentative sample of epub files I looked at, but I didn't find the way things were broken up to be very useful /meaningful -- instead I thought immediately about how I'd be collapsing or transforming them. Then, when I was exploring the gutenbergr package a little afterwards and saw the way its dataframes were set up line by line, I thought that might be a more digestible format. However, I totally understand why you don't want to start excessively messing with the epub files and would rather just dump in the sections the way they are structured. I take your word that a lot of modern epub files have more digestible structures, as well.

Re: function commenting/code This last one is hardest for me because one of the big reasons I like R is that when I'm interested in building off previous functionality from a package, I can look inside most functions to learn what they're doing. I imagine some number of epubr users would be interested in doing something along the same lines -- but maybe I'm over-emphasizing those edge cases. The difficulty level for understanding what's going on gets ramped up when there's a lot of undocumented development code inside of functions.

I tried to look for guidance regarding this in the ROpenSci Packaging Guide but couldn't find anything explicit. I think it's best practice to have pretty streamlined functions in the master branch and then a development branch with all the in-progress code that's still evolving -- something to think about in the future, maybe. Ultimately, I don't think it's too big an issue, but it made looking through your code and trying to evaluate whether everything was working as intended a bit harder.

leonawicz commented 6 years ago

Thanks for your feedback :) I definitely should have used a separate branch for the relatively dev features. It was difficult to anticipate though, given the wide range of epub formatting, what would turn out to be common practices. It became more clear in retrospect. And once I had everything intermingled, the best option at the time seemed to at least move some features into being accessible only by ....

I know it's not ideal practice, but I have seen a lot of packages that make room for a ... argument and through several initial package versions simply document ... as not currently supported. I think with some caveat in the documentation (less strong or absolute than simply saying "not supported"), the users who do decide to look at source code can be aware that segments which make use of ... may be subject to change.

If you all think this approach will not suffice please let me know. From my view it's the best option to give me time to make improvements when I can, and without removing or substantially altering code in the next master branch CRAN release from the current CRAN version in such a way that doesn't actually make the next CRAN release any better.

maelle commented 6 years ago

Btw maybe relevant https://github.com/hadley/ellipsis

calebasaraba commented 6 years ago

I think adding some caveats to the documentation is, as you suggest, the best option. With that, I consider my feedback addressed and have checked off my final approval above in my review.

leonawicz commented 6 years ago

Thanks. I have updated the documentation to indicate that, with the exception of the new optional title argument override that is now accepted by ... (and documented in the help Details), ... is currently developmental/unsupported.

I have also renamed first_nchar to epub_head. I began to make the alias and add the relevant warning to first_nchar, but then stopped and decided to simply rename the function because I realized first_nchar had never been released. It was just added at the time of this review. It's not in the CRAN release or any official GitHub release even.

suzanbaert commented 6 years ago

Hi Matt,

I've downloaded the latest version from GitHub, and tried with latest changes.

Re: Error messages. Much clearer! Thanks for fixing that.

Re: Title automatically added. Again, thanks for fixing, it was a minor comment, but it just felt weird to not have flexibility when you thought you'd had some. I think this is clearer, thanks.

Re: epub_head. I like the new function name Regarding the epub() or read_epub() - as @calebasaraba said, it's really up to you. I have no issue with it being the former given the scope of the package, it was just stylistic feedback because it's nice as a user when similar functions have same syntax irregardless of which package they are coming from. No showstopper whatsoever, it's just something I thought about.

Re: types of epub I actually tried about 5 other epubs that were not Gutenberg, and I had no issues. The text even came out clearer than the Gutenberg - I guess because project Gutenberg adds these little lines. The only one I tried that really came out impossibly messy was a technical epub book on powershell with a lot of code blocks. That one was destined to be really messy and not really due to the package :wink:

The devtools notes and warnings did not seem to have any impact on the package properly working, so I only passed those on in case it was useful to you.

leonawicz commented 6 years ago

Thanks Suzan!

Yeah, I also tried a number of Star Trek books (related to my rtrek package) using epubr. Most were novels, but a few were more like manuals or reference books. They don't work as well (and some don't have what would meaningfully be called "chapters"), but I took nominal steps toward making books like those still somewhat parsable. epubr is definitely more interesting for popular literature. But that's just my personal opinion/biased interest.

suzanbaert commented 6 years ago

I agree! I think people will use function, not so much technical literature. I wasn't expecting the powershell one to come out good, but i was curious :)

leonawicz commented 6 years ago

Also just for more insights, wanted to share, I did an interesting test.

Where one book had not only chapters, marked by ch\\d\\d, but also a "Part" 1, 2 and 3 dividing up the chapters into three groups. Those were marked in the metadata as part\\d. Each "Part" section had literature, just like the chapters did. Additionally, there were intro, epilogue, and a couple other sections that still had the "real" book text in them.

I read the book with epub several times, each time adding a more inclusive regex pattern, e.g., using "ch\\d\\d", then using ch\\d\\d|part\\d, and so on, which demonstrates how easy it is to use the chapter_pattern argument to define what you want to distinguish from the rest of the book as "chapters". In this case, someone could decide they strictly want to single out the actual chapters, or they could fold in the part breaks and prologue/epilogue, since those technically include part of the actual story.

It's difficult to show these nice examples when books are licensed works that can't be distributed.

Also, one thing I shared with @maelle on Slack that I think might not have been mentioned elsewhere is that when you do perform chapter identification on books using a regex pattern, epubr is, in general, able to reorder the sections appropriately by chapter sequence. Many books I have seen, even when they have really nice metadata formatting, have their book sections occur in an arbitrary order, giving a scrambled appearance to the resulting data rows. So epubr can handle this elegantly is a lot of cases.

Other text parsing packages, like the more general tika package I saw recently, just dump contents into a single character string, and in such cases I found that the text of course was all out of order. epubr's output structure allows text order to be maintained if important to the user, and chronological order can actually be established even when reading the raw file would otherwise load text in a more scrambled order.

maelle commented 6 years ago

@calebasaraba I see you've now checked the box "The author has responded to my review and made changes to my satisfaction. I recommend approving this package." in your review, perfect, thanks. @suzanbaert can you do that as well if you're happy with all changes?

I'll do last editor's checks next week at the latest. Thanks everyone for a very constructive process!

suzanbaert commented 6 years ago

Done!

maelle commented 6 years ago

Approved! 🚀 Thanks @leonawicz for submitting and thanks @calebasaraba @suzanbaert for your reviews! 😺

To-dos:

[x] Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
[x] Add the rOpenSci footer to the bottom of your README " [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)"
[x] Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
[x] We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
[x] Make a PR to pubcrawl with a link to the repo (under the ropensci organization) since @hrbrmstr offered to archive his repo in this comment.

Should you want to awknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.

leonawicz commented 6 years ago

Thank you! @maelle I've added the footer and the related packages section. I used the @ syntax for authors you suggested, thinking that when it appear on GitHub it was intended to act as a link to each author's profile or something. But maybe you did not mean that. I'm thinking I should switch it to people's proper names then? There is already a link to each package mentioned.

I added the codemeta.

I'll do the transfer soon.

I'm not sure what the pull request to pubcrawl should entail. I have not done anything like this before regarding archiving packages. Also, aside from the AppVeyor line mentioned above, are the other substitutions as simple as swapping out my account name for ropensci? I will also update the url in the package hex sticker logo.

Thanks! 😸

maelle commented 6 years ago

Thanks!

No for authors it means using the Authors@R field in DESCRIPTION, sorry it was unclear.
Update the thread once the transfer is done. Yes it should actually be that simple. There are two URLs to update in DESCRIPTION too (URL and BugReports fields).
Ok will do the PR once you've transferred the repo and tag you!

stefaniebutland commented 6 years ago

@leonawicz this link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length: https://ropensci.org/tags/review/. Technotes are here: https://ropensci.org/technotes/.

Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post.

Happy to answer any questions.

maelle commented 6 years ago

The blog post discussion can continue, but closing this since all transfer steps have now been done!

stefaniebutland commented 5 years ago

@leonawicz In light of your tweet https://twitter.com/leonawicz/status/1064224408900820992 we'd welcome a technote on this update if you're interested. (thanks for the heads up @maelle) https://ropensci.org/technotes/