Closed leonawicz closed 6 years ago
Greetings, Matt!
I am technically not a "reviewer" but one of the best "connectors" in the R univeRse connected your pkg with a quick hack of mine from earlier this year (pubcrawl
).
First, congrad on the CRAN acceptance! You had to do the "roll up the sleeves" work to deal with file extraction to do that and I lazily relied on the archive
package (which is not on CRAN) for that and managing unarchiving is never fun code to write so serious kudos for that. We took some rly similar approaches (which is kinda super cool IMO) and you went a few extra miles (so, again, ๐ ).
The only thing I rly cld suggest (besides a personal preference to remove some tidyverse dependencies) is to take a look at https://github.com/hrbrmstr/pubcrawl/blob/master/inst/xslt/justthetext.xslt as that contains set of "cleaner" XSL template matchers garnered from spending way too much time dissecting various "readability" idioms. Yours has a good catch-all, and a "fun" (I have weird ideas of fun) exercise might be to extract tags from a large or at least a decent representative random corpus of epubs to see which ones tend to be there and have a secondary template that explicitly targets around them. This may not be necessary at all given that epubs aren't as "evil" as general web pages and a catchall might be fine.
Again, great job! And, if there's anything (which is unlikely) you need from my pkg carve away (and no need for attribution since I just made mine due to a random request on Twitter). Also, if you need resources for the suggested corpus/xslt testing, def lemme know.
I will let the amazing rOpenSci folks go back to their regularly scheduled reviewing activities now :-)
Thanks a lot @hrbrmstr!
Thanks for your submission @leonawicz, I'll proceed with other checks now that overlap with pubcrawl
is clarified.
Thank you both! Much appreciated.
@hrbrmstr I overlooked pubcrawl
somehow but I had initially began by looking at your hgr
package, specifically the cleaner template here (looks the same as direct from xslt) https://github.com/hrbrmstr/hgr/blob/master/inst/xslt/justthetext.xslt
I sent you a couple emails about it on May 27/28 that might have fallen through the cracks but also the second email basically said to disregard the first because I'd made a mistake. Essentially it turned out that almost every line in the cleaner template had zero impact on my (non-representative) sample of epubs. At first I was surprised that almost the whole cleaner template wasn't doing anything, but then it made sense when I considered what many of the tags were named. I guess they don't show up in epubs really.
But it was not quite so simple as removing lines that had no impact. There were a small number of lines in the cleaner template that actually appeared to have a detrimental effect on what type of text was returned for some sections of an epub. I did the best I could from the personal sample I had (plus some project Gutenberg, but again, that is only representative of project Gutenberg as far as I can tell) to make a judicious selection for the filter that seemed to cause the least problems and returned text most ideally.
What I ended up with is what you see in my version in inst/text.xml
. It's virtually empty and this actually worked best in my tests. In fact it's so empty I wanted to remove it entirely so that I could drop another package dependency, xslt
. But removing the call to the near-empty cleaner template resulted in epub text coming back still looking uncleaned unfortunately.
Normally I'm partial to removing as many dependencies as possible but in this case I favor retaining minimal tidyverse package imports. I did make readr
optional via Suggests
and using a call to requireNamespace
to check for availability or fall back on base R's readLines
. But I found it a real chore to do the rest without packages like purrr
, dplyr
and tidyr
. More importantly, I wanted users to receive tidyverse tibbles as returned values, especially considering the horrid console print calls that could otherwise be made inadvertently on data frames containing millions of text characters from books. I didn't see a reason to try to reinvent those tidyverse wheels with custom print methods and whatnot. So I decided to retain functionality and presentation using familiar tidyverse tools that users reading book collections into R are very likely to be using as part of their workflow after reading in the book anyhow.
I was driving relatively blind with the cleaner template, but considering the best results I got were when using a virtually blank template, it'd be great to know if there is some way to remove the xslt
dependency and do whatever it is doing by using a small custom R function for text substitution. I don't actually know if this idea is realistic or more trouble than it's worth at the moment.
I like the idea of having a more representative sample of epubs from various publishers. I don't have any ideas on where to get this. There's also a sampling issue. Many epubs users may elect to read in may not have DRM but are still licensed books that can't be shared as part of a corpus. I don't have stats on this but I wouldn't be surprised if an available corpus of books from a variety of sources still won't represent the bulk of books most people encounter since it would be missing all the books most people purchase these days.
Even with a better non-representative sample, it's a painstaking process. You still have to kind of "check" (minimally) each book by eye to see if the filter did something unusual to a given book. Eventually I had to accept that as much as I'd prefer to have figured out all the gotchas first so that I could minimize susbequent GitHub issues being reported, I was going to need to release the package and wait to see if/what people report as strange.
Thanks so much again Bob and let me know if you have any more thoughts re: mine! :) Matt
Email & I have been bitter, bitter enemies of late. So it's more of a "wall" between me & it vs bits slipping through cracks :-)
I had an inkling that the style sheet transformer might be super aggressive for an epub (and epubs have, er, "diverse" XML formats, too) that your straightforward approach is likely much, much safer.
I was going to suggest using https://github.com/hrbrmstr/freebase
(since it'd cover the purrr
and some dplyr
use cases in the pkg pretty well) but you'd still need tidyr
. Plus, it's super likely that a majority of the pkg audience will be tidyverse users.
If it's OK with y'all I'll likely set the pubcrawl
repo to an "archive" state and replace the README with a link to the rOpenSci'd version of epubr
once the review is done. If so, pls @ me on the final review thread box and I'll get'er done.
Again, rly nice work!
Thanks! :) I'll have to go back and look at the details when I have a chance, but I can probably remove purrr
in a later package version by replacing with some uglier code, but I think dplyr
is the heavier dependency, and unfortunately the one I particularly want to retain.
The diversity of formats is also why I could not be as aggressive as I would have liked with things like the optional attempts at chapter identification. To some degree that will have to be left to the user so that they can adjust things on a case by case basis without epubr
ruining their day. I was bummed to see that of all epubs I looked at, go figure, the example Project Gutenberg book I included as an example file in the package has some of the most problematic formatting. That bummed me out haha. I'm guessing it's probably really common, but so far it's the only book I've seen where the internal html files that mark the different book sections actually begin and end in the middle of chapters, as though they cared more about keeping each html file approximately the same size rather than breaking the text into meaningful sections. So lame! :)
In a future version I would like to include a function that loads a book in the browser as a simple e-reader using shiny
perhaps, but after seeing the arbitrary section breaks in that book I had to put that thought on hold.
Thanks again for your help!
Nice to see such a constructive discussion even before the actual review process starts ๐ Thanks again for chiming in @hrbrmstr
Btw my checks are still pending but I'll probably recommend to add a link to https://github.com/ropenscilabs/gutenbergr in the README @leonawicz ๐
@maelle Done, via latest commit. Thank you for the suggestion. It's under the Reference heading. Let me know if I should format it differently.
Thanks again for your submission @leonawicz!
The README states "Subsequent versions will include a more robust set of features.". Before I start looking for reviewers, could you
List what these features would be (e.g. as issues in your repo associated to a milestone )
Confirm you don't want to implement any of them before or during the review process.
We can put this submission on hold so that the reviewers look at a more finished product, which in my opinion would help you make the most of our onboarding process.
In any case, writing down what your ideas are will help lessen duplication of efforts, since reviewers often suggest enhancements.
Reviewers: @calebasaraba @suzanbaert Due date: 2018-07-02
@maelle Thanks so much!
I have removed that line from the readme. I do not have a clear concept at this time of specific enhancements I plan to include. I had tentatively thought about including a demo Shiny app that acts as a browser-based epub reader, but it may be quite a bit more complicated than I thought. It is not something I have considered in much detail yet.
I do not plan to add any specific functionality at this time. Pending any issues and feature requests users may submit later on via GitHub, that would probably give me a better sense of what kinds of additional functions might be helpful to other users. So far I have not received any user feedback so I do not plan to make any changes.
Thanks for your answer @leonawicz! I'll now look for reviewers!
It occurred to me that I've already been in the code and know a bit abt the data format if you do need a volunteer :-)
@hrbrmstr thanks a lot for volunteering! You've however reviewed another package recently so I asked other reviewers before seeing your message, sorry about that and thanks again for offering your expertise!
@leonawicz you can now add a peer-review badge to the README of your package
[![](https://badges.ropensci.org/222_status.svg)](https://github.com/ropensci/onboarding/issues/222)
@calebasaraba @suzanbaert thanks a lot for accepting to review this package! Your reviews are due on 2018-07-02. As a reminder here are links to the recently updated reviewing and packaging guides and to the review template.
Badge added. Thank you.
@calebasaraba @suzanbaert ๐ Friendly reminder that your reviews are due on Monday, 2018-07-02. Thank you! ๐บ
Yes! I already did some experimentation and i have time set apart on monday to finish the last few items. But first a beautiful summer sunday ๐
Thanks, have a good Sunday! ๐
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 4
First of all, I love the idea of the package and the ability to add a vector of epubs. I also was quite amazed at the speed, I would have thought it was going to take longer to turn a few quite sizeable books into an R object. Functions also work as part of a pipe, and plays very nice with dplyr.
A few suggestions I wrote down along the way:
Error messages
When adding a non valid file (file doesn't exist or was wrongly spelled) the error message is quite mysterious: Error in if (grepl("<|>", x)) { : argument is of length zero In addition: Warning message: In utils::unzip(file, exdir = exdir) : error 1 in extracting from zip file.
It might be worth adding a friendlier error message that just reads "file could not be found" or "file is not a valid epub file" or something similar?
Similarly, it seems that title is a mandatory field in the fields argument. When 'title' is not present, you get an error: Error: Column 'title' must be a 1d atomic vector or a list. In addition: Warning message: Unknown or uninitialised column: 'title'.
I would suggest to either make 'title' a standard field that automatically gets added (like data) or to make title unnecessary.
Function naming
Secondly, regarding the function names, epub_meta
and epub_unzip
are perfectly in line with the rOpenSci's packaging guidelines, but the other two are not. I was wondering especially about epub
, it would intuitively make more sense to name it read_epub()
which would make it more in line with nearly all other import functions in R people will know about. Alternative could be epub_import
or something similar?
Similarly with first_nchar
, perhaps something like epub_glimpse()
or anything similar that does start with epub_ and makes it clear you're getting a sneak peek on the file?
README documentation
Last but not least, though everything is in the documentation and R veterans probably will have no issues, I would make the documentation a bit more beginner's friendly.
One element in particular caught my eye: the instructrions in the readme to import an epub. It's probably silly, but when I first installed and tried to import an epub (i was in play-modus and so not very concentrated), but I also did a system.call() followed by epub(). I somehow thought I had to make a system file from my epub and then sort of unzip it. (Needless to say that I never used the system.call() function before so wasn't entirely sure what it did...)
It took only a mysterious error and a bit more focus to realize my silly mistake, but I did wonder whether the README could just be a bit clearer. Perhaps something along these lines? Tiny change, but maybe a few frowns less from people first using the package?
#importing an epub file
x <- epub("path/to/epubfile.epub")
#to use the example dracula.epub that is built-in with epubr:
file <- system.file("dracula.epub", package = "epubr")
x <- epub(file)
Devtools checks
# results devtools::test()
test-lintr.R:4: warning: Package Style
incomplete final line found on 'E:/Rprojects/epubr/.lintr'
test-lintr.R:4: error: Package Style
Invalid DCF format.
Regular lines must have a tag.
Offending lines start with:
inst/.lintr
1: lintr::expect_lint_free() at E:\Rprojects\epubr/tests/testthat/test-lintr.R:4
2: lint_package(...)
3: read_settings(path)
4: read.dcf(config_file, all = TRUE)
5: stop(gettextf("Invalid DCF format.\nRegular lines must have a tag.\nOffending lines start with:\n%s",
paste0(" ", lines, collapse = "\n")), domain = NA)
Regarding devtools::check()
, it complained about two packages i do not have installed on this PC, but there also was a warning regarding qpdf
.
# results devtools::check()
Status: 1 WARNING, 1 NOTE
* checking package dependencies ... NOTE
Packages suggested but not available for checking: 'lintr' 'covr'
* checking for unstated dependencies in examples ... OK
WARNING
'qpdf' is needed for checks on size reduction of PDFs
One more thing: I originally planned to get a bit more into the code of the functions itself, but I ran out of time. So my review above is me as user of the package, not into the actual code.
Thanks a lot for your review @suzanbaert! ๐ธ And thanks for the disclaimer as well!
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).Estimated hours spent reviewing:
6-8
I think this is a solid idea for a package and it accomplishes its main goal well: to provide users with functions to transform epub
formatted ebooks into neatly nested R dataframes that can be used for further textual analyses. Very well done!
I found that the package was easy to use and performed well at these actions using R 3.5 on windows and mac systems. I wrote down some of my thoughts below:
Function commenting
While working through the code in epub.R
, I had a bit of a hard time parsing the multitude of if
statements. I understand that they are necessary, but I was wondering if there was a stylistic way to structure them in the code to be easier for a new user sifting through the function details. Perhaps one way would be to change some of the single-line if statements into curly-braced statements with indentations, or providing a few more in-line comments to explain what's happening. I mention this because, due to the heterogeneity of epub
formats, I think a lot of your more adventurous users will likely be looking at your actual code while trying to learn more about epub
processing and how they can tweak it to their own unique situation / format.
Function naming
It may make more sense to change the epub()
function name to the object_verb
style recommended by the ROpenSci packaging guidelines, something like epub_import()
or switching its name with the internal epub_read()
. Similarly, first_nchar()
could be something like epub_head()
.
Documentation note
I understand from your post above that you're trying to limit dependencies, and you've done an admirable job. I'm not sure if you avoided it for a particular reason, but I've found it easier to use @importFrom
in roxygen2
when documenting a particular function rather than remembering what packages I'm using throughout and adding them to the DESCRIPTION
manually, which I think is how you did it here?
Chapter separation functionality
I didn't have any problems using the epub()
or epub_unzip()
functions on either mac or windows using R version 3.5.
I downloaded several epub files from project gutenberg and played around with loading them in with the epubr
package. In each instance, the metadata fields were easily readable and correctly formatted, but I could never really get the chapter separation to work very well, and I did not find the default item separation useful / meaningful. I understand how challenging it would be to get this sort of separation functionality to work across a wide variety of epub
files, and you've put in effort to make it work at all -- but I wonder if it philosophically makes sense to make chapter and/or item separation the default behavior if it is not very generalizable (or in the case of item, that useful?). Perhaps it would be better have the default nested dataframe structured more like the output from gutenbergr
, which seems to be a straightforward reproduction of the text present in the epub
file?
Documentation / Vignettes
I like that you chose to include an epub
file (Dracula) that was poorly formatted for demonstration purposes in your vignette, but it also might be instructive for your users to show an example where the chapter separation works smoothly, if possible (this would be especially true if you change the default chapter/item separation behavior and want to highlight this as optional functionality).
On the topic of chapter separation / funky formatting, you include the following paragraph that refers to some sort-of hidden functionality:
These developmental arguments are currently undocumented,
though they can be explored if you are inclined to read the
package source code and pass additional arguments to ....
They have been tested successfully on many e-books, but
certainly not a representative sample of all e-books.
The approaches these arguments use may also change before
they are formally supported and explicitly added to a future
version of the package.
I wonder if it is a necessary addition to the vignette, since without documentation or examples of test cases, it only serves to tantalize the reader! I noticed earlier in the submission issue you mentioned that you didn't have any concrete next steps for development, so perhaps some examples or explanation of these arguments could be one; alternatively, you could consider removing this reference from your documentation.
Thanks a lot for your review @calebasaraba! ๐ฅ โฝ๏ธ ๐
@leonawicz now both reviews are in! As per our author's guide we ask that you respond to reviewersโ comments within 2 weeks of the last-submitted review, but you may make updates to your package or respond at any time.
Two notes:
Reg imports I don't know what @calebasaraba meant since I haven't looked at the code as thoroughlyt as he did but as a reminder we recommend using the pkg::function syntax. Cf http://r-pkgs.had.co.nz/namespace.html#imports It doesn't add packages to DESCRIPTION though. usethis::use_package
allows one to add packages to DESCRIPTION more automatically, and this new package of Hadley Wickham's might help finding all used dependencies.
If you rename the functions have a look at our guidance regarding "changing stuff in your package"
Hi all! Thank you all so much for your feedback and hard work! It is much appreciated. :) Some changes have been implemented already. Those and other replies below.
Agreed, thank you for catching that. I have made more informative error messages for missing files and non-epub files, for epub, epub_meta, and epub_unzip.
This was a big issue. Thank you again. I have made title
automatic like data
.
Context: title is important because it acts as the most likely/dependable field to uniquely identify books when file
is a vector.
Of course, multiple books could have the same title, such as multiple editions of one book. Title is not guaranteed to be unique. However, it is the best bet. As such, I have made title
the field that falls back on file names (mostly likely unique identifier) when title
is not a field in the metadata. title
can also be passed as an additional argument (...
) to epub
so the user can override the name of the metadata entry containing title information. See updated help doc for details.
I have not encountered any epubs yet lacking a title field, but I did my best to test this out by passing the additional title
argument using non-existing field names.
I am open to renaming first_nchar
to something like epub_text
, epub_head
, epub_glimpse
.
I suppose given the way it works, I am partial to epub_head
. There could also be a tail function (though probably not much use), or a custom slice (could be useful), in addition to just looking at the first n characters. I think this makes *head
the best suffix for the current function.
I'm not inclined to lengthen epub
with an additional verb. In my view, epub
is meaningful, clear, but also appropriate for such a small package with such a singular use. It's easier to view it as the master, over-arching package function. It also does not live alongside the extraneous functions like epub_meta
and epub_unzip
but wraps around them and encapsulates the whole package purpose. I don't think there is really anything else in such a small package to confuse epub
's use with.
The example using system.file
is pretty standard, but I have added a comment above the line to make it more clear what it means.
I agree I probably did more harm than good by indicating that there is developmental functionality potentially accessible via ...
that I want to integrate explicitly later but which remains currently far too open to changing approaches at this time. I have removed the reference from the vignette.
I'm confused regarding this comment, sorry. I used usethis
and explicit namespaces in the package. I used @importFrom
only if necessary, e.g. @importFrom magrittr %>%
.
I'm unclear about "did not find the default item separation useful / meaningful". Because this is generally unpredictable, the default is to attempt no chapter identification. drop_sections
and chapter_pattern
are both NULL
unless the user says otherwise in the context of their known e-books.
The default is to simply dump each book section into its own row of the data
data frame (you may mean this). My approach is to leave this structure as is, and users can alter it further however they want. For many books, this is a very good format because this does give a meaningful and useful breakout of sections for many books. It also is in keeping with the general approach in epubr
of not altering the file content (e.g., by automatic text cleaning steps that users can't prevent) other than to basically strip the xml tags. Things are made human readable but otherwise left close to the original table structure.
I think we should not pay much attention to Project Gutenberg (PG) texts honestly. It is a convenient place to grab a public domain epub file for package testing and ability to share the content/examples.
In fact, this is why I included one in the package. However, my perspective is that PG is potentially highly unrepresentative of the e-books users will tend to load with epubr
as well as not a very realistic use case:
I have looked at books from a variety of publishers. I don't consider this to be representative of "all epubs" of course, but I did notice that none had formatting like PG books I've looked at (which I admit is far fewer). I have looked at a lot of licensed/copyrighted books, the type you have to purchase, from a range of publishers and years. I consider these much more representative of the types of e-books users will load with epubr
. This was the motivation for the package.
Much of the public domain stuff can be obtained in multiple formats and pulled directly into R from the web. epubr
is better for when you have local files, and likely nothing but the local files in this single format. For licensed/purchased books, users probably only have one copy in one format. It may not even be epub originally, but they may have locally converted it to epub (say, from .mobi after buying from Amazon, or something similar) using other software like Calibre
. This is more fundamentally what epubr
is for.
With the goal of loading a novel's text into R to be used in a subsequent text analysis, when it comes to PG there is no reason at all to use epubr
. To do so requires creating an unnecessary intermediary step of obtaining local epub format files from PG, then loading them into R with epubr
, but none of that is necessary. All of that can easily be skipped, for example by just using gutenbergr
if I'm not mistaken.
So overall, I think PG is not only not a good/gold standard for evaluation, but actually an exceptional edge case/not a representative use case for epubr
or of epubs in general. Second, PG also has no need at all for an epubr
or other type of intermediary epub
format processing stage. It's hard to say what is most representative of "all books", or at least "all popular books," but I am pretty confident that PG is not it. I would say epubr
is more for use cases where the files themselves, the users' collections of local epub files, represent the bottleneck of their entire analysis, in that it may be the sole source of the particular data that a user has access to. Apart from specific research involving PG titles, the intersection of most users' e-book collections with the public domain works on PG is probably very small.
Thoughts?
test-lintr.R:4: warning: Package Style incomplete final line found on 'E:/Rprojects/epubr/.lintr'
I think this can be ignored. It is possible that if you pull the repo down to some systems, the symlink stops being seen as a symlink (Windows?). On my local Windows it is fine (no warnings). Recreating the symlink can fix the problem on a local system. But this passes on Windows, Mac and Linux (Travis-CI, I think with warnings as errors). This file just links to the real lintr file in the inst
directory. This setup may look weird but it has to do with getting lintr to work with unit testing on systems like Travis-CI for package builds.
test-lintr.R:4: error: Package Style Invalid DCF format. Regular lines must have a tag. Offending lines start with: inst/.lintr 1: lintr::expect_lint_free() at E:\Rprojects\epubr/tests/testthat/test-lintr.R:4 2: lint_package(...) 3: read_settings(path) 4: read.dcf(config_file, all = TRUE) 5: stop(gettextf("Invalid DCF format.\nRegular lines must have a tag.\nOffending lines start with:\n%s", paste0(" ", lines, collapse = "\n")), domain = NA)
I'm not sure if this is more of the same. I am unable to reproduce these errors locally or remotely.
"warning regarding qpdf."
It's been a long time since I installed qpdf, so I don't remember the exact complaint, but if I didn't have it installed then local package builds threw some kind of message or warning about it not being available.
Status: 1 WARNING, 1 NOTE
I think this is normal, just need to have them on the system if you are building a package that uses them. I'm saying "I think" a lot, which is just to say, if anything I'm saying is not quite right, please feel free to jump in and set things on the right track.
checking for unstated dependencies in examples ... OK WARNING 'qpdf' is needed for checks on size reduction of PDFs
Yeah, I think this is the warning I get rid of by making sure qpdf is installed.
This is one area where it's possible it may have been premature for me to submit my package for review? I can't speak to what other downloaders of epubr
are doing with it, but just myself I have plenty of unofficial use cases already in other projects that are dependent on the still-evolving/undocumented code inside some of the package functions. I can't remove it, but I also can't really say what it will look like later (or when I will get to it, since this is a personal project).
I published to CRAN because what was official/documented, was ready to go. The parts that are in the works though, it's just too early to say exactly what that will look like and trying to document it or give examples would be getting into the weeds I think. Please let me know what you think. I can always resubmit in the future if there is too much problematic about the state of some of the internals.
Thanks again so much! :) Please forgive me if I forgot to address some comments. Or if I shouldn't have addressed both reviews in one reply, but I saw some overlap. It took a while to compile this (probably not nearly as long as it took you to review the package code and I am very grateful). Please let me know if I missed anything or confused anything.
It was a fun time taking a look at the epubr
package, nice work! I tried to continue the exchange by addressing topics I thought were from my review:
Re: dependencies/import note
Feel free to ignore this comment of mine -- I have been using @importFrom
as the standard documentation strategy in my own practice, but the usethis
method both you and @maelle have mentioned is better. Always learning!
Re: function naming
Ultimately the naming of the epub
function is up to you, but I like functions with verbs -- I think the read_epub
suggestion by @suzanbaert was nice since there are so many import functions that follow that style in R packages already. However, I can agree to disagree about this stylistic issue!
Re: chapter separation / additional context
Thank you for giving the additional context as well as explaining the peculiarities of project gutenberg epub files. I think I understand more clearly what you envision the purpose of the epubr
package to be, and that you think most users will not have a large number of project gutenberg style files. I suppose when I think about big text analyses I often think about what's easily available in the public domain, but there's no reason that must be the case.
I did notice that the default settings for drop_sections
and chapter_pattern
were NULL
, but was, as you guessed, trying to refer to the way sections were split up into the rows of the nested dataframe. It may have just been the small unrepresentative sample of epub files I looked at, but I didn't find the way things were broken up to be very useful /meaningful -- instead I thought immediately about how I'd be collapsing or transforming them. Then, when I was exploring the gutenbergr
package a little afterwards and saw the way its dataframes were set up line by line, I thought that might be a more digestible format. However, I totally understand why you don't want to start excessively messing with the epub files and would rather just dump in the sections the way they are structured. I take your word that a lot of modern epub files have more digestible structures, as well.
Re: function commenting/code
This last one is hardest for me because one of the big reasons I like R is that when I'm interested in building off previous functionality from a package, I can look inside most functions to learn what they're doing. I imagine some number of epubr
users would be interested in doing something along the same lines -- but maybe I'm over-emphasizing those edge cases. The difficulty level for understanding what's going on gets ramped up when there's a lot of undocumented development code inside of functions.
I tried to look for guidance regarding this in the ROpenSci Packaging Guide but couldn't find anything explicit. I think it's best practice to have pretty streamlined functions in the master branch and then a development branch with all the in-progress code that's still evolving -- something to think about in the future, maybe. Ultimately, I don't think it's too big an issue, but it made looking through your code and trying to evaluate whether everything was working as intended a bit harder.
Thanks for your feedback :) I definitely should have used a separate branch for the relatively dev features. It was difficult to anticipate though, given the wide range of epub formatting, what would turn out to be common practices. It became more clear in retrospect. And once I had everything intermingled, the best option at the time seemed to at least move some features into being accessible only by ...
.
I know it's not ideal practice, but I have seen a lot of packages that make room for a ...
argument and through several initial package versions simply document ...
as not currently supported. I think with some caveat in the documentation (less strong or absolute than simply saying "not supported"), the users who do decide to look at source code can be aware that segments which make use of ...
may be subject to change.
If you all think this approach will not suffice please let me know. From my view it's the best option to give me time to make improvements when I can, and without removing or substantially altering code in the next master branch CRAN release from the current CRAN version in such a way that doesn't actually make the next CRAN release any better.
Btw maybe relevant https://github.com/hadley/ellipsis
I think adding some caveats to the documentation is, as you suggest, the best option. With that, I consider my feedback addressed and have checked off my final approval above in my review.
Thanks. I have updated the documentation to indicate that, with the exception of the new optional title
argument override that is now accepted by ...
(and documented in the help Details), ...
is currently developmental/unsupported.
I have also renamed first_nchar
to epub_head
. I began to make the alias and add the relevant warning to first_nchar
, but then stopped and decided to simply rename the function because I realized first_nchar
had never been released. It was just added at the time of this review. It's not in the CRAN release or any official GitHub release even.
Hi Matt,
I've downloaded the latest version from GitHub, and tried with latest changes.
Re: Error messages. Much clearer! Thanks for fixing that.
Re: Title automatically added. Again, thanks for fixing, it was a minor comment, but it just felt weird to not have flexibility when you thought you'd had some. I think this is clearer, thanks.
Re: epub_head. I like the new function name Regarding the epub() or read_epub() - as @calebasaraba said, it's really up to you. I have no issue with it being the former given the scope of the package, it was just stylistic feedback because it's nice as a user when similar functions have same syntax irregardless of which package they are coming from. No showstopper whatsoever, it's just something I thought about.
Re: types of epub I actually tried about 5 other epubs that were not Gutenberg, and I had no issues. The text even came out clearer than the Gutenberg - I guess because project Gutenberg adds these little lines. The only one I tried that really came out impossibly messy was a technical epub book on powershell with a lot of code blocks. That one was destined to be really messy and not really due to the package :wink:
The devtools notes and warnings did not seem to have any impact on the package properly working, so I only passed those on in case it was useful to you.
Thanks Suzan!
Yeah, I also tried a number of Star Trek books (related to my rtrek
package) using epubr
. Most were novels, but a few were more like manuals or reference books. They don't work as well (and some don't have what would meaningfully be called "chapters"), but I took nominal steps toward making books like those still somewhat parsable. epubr
is definitely more interesting for popular literature. But that's just my personal opinion/biased interest.
I agree! I think people will use function, not so much technical literature. I wasn't expecting the powershell one to come out good, but i was curious :)
Also just for more insights, wanted to share, I did an interesting test.
Where one book had not only chapters, marked by ch\\d\\d
, but also a "Part" 1, 2 and 3 dividing up the chapters into three groups. Those were marked in the metadata as part\\d
. Each "Part" section had literature, just like the chapters did. Additionally, there were intro, epilogue, and a couple other sections that still had the "real" book text in them.
I read the book with epub
several times, each time adding a more inclusive regex pattern, e.g., using "ch\\d\\d"
, then using ch\\d\\d|part\\d
, and so on, which demonstrates how easy it is to use the chapter_pattern
argument to define what you want to distinguish from the rest of the book as "chapters". In this case, someone could decide they strictly want to single out the actual chapters, or they could fold in the part breaks and prologue/epilogue, since those technically include part of the actual story.
It's difficult to show these nice examples when books are licensed works that can't be distributed.
Also, one thing I shared with @maelle on Slack that I think might not have been mentioned elsewhere is that when you do perform chapter identification on books using a regex pattern, epubr
is, in general, able to reorder the sections appropriately by chapter sequence. Many books I have seen, even when they have really nice metadata formatting, have their book sections occur in an arbitrary order, giving a scrambled appearance to the resulting data
rows. So epubr
can handle this elegantly is a lot of cases.
Other text parsing packages, like the more general tika
package I saw recently, just dump contents into a single character string, and in such cases I found that the text of course was all out of order. epubr
's output structure allows text order to be maintained if important to the user, and chronological order can actually be established even when reading the raw file would otherwise load text in a more scrambled order.
@calebasaraba I see you've now checked the box "The author has responded to my review and made changes to my satisfaction. I recommend approving this package." in your review, perfect, thanks. @suzanbaert can you do that as well if you're happy with all changes?
I'll do last editor's checks next week at the latest. Thanks everyone for a very constructive process!
Done!
Approved! ๐ Thanks @leonawicz for submitting and thanks @calebasaraba @suzanbaert for your reviews! ๐บ
To-dos:
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
"[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname)
.pubcrawl
with a link to the repo (under the ropensci organization) since @hrbrmstr offered to archive his repo in this comment.Should you want to awknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"
-type contributors in the Authors@R
field (with their consent). More info on this here.
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.
Thank you! @maelle I've added the footer and the related packages section. I used the @
syntax for authors you suggested, thinking that when it appear on GitHub it was intended to act as a link to each author's profile or something. But maybe you did not mean that. I'm thinking I should switch it to people's proper names then? There is already a link to each package mentioned.
I added the codemeta
.
I'll do the transfer soon.
I'm not sure what the pull request to pubcrawl
should entail. I have not done anything like this before regarding archiving packages. Also, aside from the AppVeyor line mentioned above, are the other substitutions as simple as swapping out my account name for ropensci
? I will also update the url in the package hex sticker logo.
Thanks! ๐ธ
Thanks!
No for authors it means using the Authors@R field in DESCRIPTION, sorry it was unclear.
Update the thread once the transfer is done. Yes it should actually be that simple. There are two URLs to update in DESCRIPTION too (URL and BugReports fields).
Ok will do the PR once you've transferred the repo and tag you!
@leonawicz this link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length: https://ropensci.org/tags/review/. Technotes are here: https://ropensci.org/technotes/.
Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post.
Happy to answer any questions.
The blog post discussion can continue, but closing this since all transfer steps have now been done!
@leonawicz In light of your tweet https://twitter.com/leonawicz/status/1064224408900820992 we'd welcome a technote on this update if you're interested. (thanks for the heads up @maelle) https://ropensci.org/technotes/
Summary
Extract, read and parse EPUB format e-book file archive metadata and book text into tidy data frames to prepare for subsequent text analysis.
https://github.com/leonawicz/epubr
Data extraction. This package focuses on importing EPUB file metadata and data into R in a useful format. It strips xml tags and formatting to focus on the readable, meaningful text. While future package versions will expand functionality around the edges, the core purpose will remain the data extraction described.
Data analysts or researchers performing text mining and other language analysis involving individual books or book collections in a typical, unrestricted EPUB file format.
I have not found other R packages that do this.
This package is already on CRAN (v0.4.0).
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
[x] Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
To the best of my knowledge, though it's quite possible I missed something small or have something not exactly as required.
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names: