Closed cboettig closed 6 years ago
👋 @cboettig Thanks for the submission. Since it would be a COI for any of the four editors to handle your submission, I have asked @lmullen to serve as the ad hoc editor on this one and he has graciously agreed. I will let Lincoln take it from here. 🚀
Passes devtools::check()
without issue.
R CMD check results
0 errors | 0 warnings | 0 notes
Here is the result of goodpractice::gp()
.
It is good practice to
✖ write unit tests for all functions, and all
package code in general. 97% of code lines are covered by
test cases.
R/rdf.R:274:NA
R/rdf.R:275:NA
✖ not import packages as a whole, as this can cause
name clashes between the imported packages. Instead, import
only the specific functions you need.
97% coverage is excellent.
The advice not to import entire packages is up to the discretion of the reviewers.
Note one misspelling (occured
) from devtools::spell_check()
.
@karthik I am now approaching reviewers.
Reviewers: Due date:
Two reviewers have agreed to review this package. Reviewers, thanks for being willing, and I'll ask you to have your reviews in within three weeks. Here is the reviewer's guide. Feel free to let me know if you have any questions.
Reviewer: Anna Krystalli, @annakrystalli Reviewer: Bryce Mecum, @amoeba Due date: 2018-01-25
@cboettig Could you please add the rOpenSci under review badge to the README for this package? Here is the snippet.
[![](https://badges.ropensci.org/169_status.svg)](https://github.com/ropensci/onboarding/issues/169)
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [x] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [x] A short summary describing the high-level functionality of the software
- [x] Authors: A list of authors with their affiliations
- [x] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [x] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 2
This looks like an excellent package for inclusion in the ropensci ecosystem. I have personal experience with the problem this package is trying to solve, namely working with RDF in R from the analysts perspective. The RDF package this package wraps, redland
, is intended more for developers to write packages on top of than for end users to do things like run SPARQL queries or manipulate RDF graphs. I would use this package.
The package is laid out in a non-surprising manner, most functions are short and well-scoped, and, overall, the code is very readable. The accompanying test suite is reasonable and provides 100% coverage, and the single vignette is well-written and useful.
I did find the documentation could use some polish in some places (see comments below). I suspect a pass or two by the author would make some good improvements without much work.
I have left two checkboxes unchecked due to the following issues:
but I otherwise found everything else to be in order.
How are you dealing with differentiating between resource, literal, and blank nodes in rdf_add
? As far as I can see, it looks I can't create a new triple with a resource or blank node as the subject:
> rdf_add(x, "test", "test", "test")
> x
<test> <test> "test" .
I'd like to see more usage of the RDF capabilities of rdflib in the vignette. At current it's centered around JSON-LD
These were written out as I went through the checklist:
Documentation could benefit from a general pass for capitalization just to make things look nicer
\link{}
s in docs for rdf_serialize
and rdf_query
link to wrong version of the parse
function
Noticed funky encoding issues from vita.json. "Fernández" => "Fern\u00E1ndez" Not sure if this is an rdflib problem or a redland one.
I think any mentions of "RDF+XML" outside of MIME types should be "RDF/XML", not "RDF+XML"
I thought arg doc
in rdf_serialize
could be path
instead as it's a little more clear.
I'm generally not a fan of arguments named x
, though it feels warranted in some cases. Could a more descriptive name be used in these instances? For example, rdf_query
has a first arg of x
which could, instead, be rdf
.
Perhaps a little more error checking...
If I send a malformed query, I get useful but a bit cryptic response from rdf_query
:
> coi <- vita %>% rdf_query(sparql)
librdf error - syntax error, unexpected a
rdf_query_results.c:100: (librdf_query_results_finished) assertion failed: object pointer of type librdf_query_results is NULL.
rdf_query_results.c:100: (librdf_query_results_finished) assertion failed: object pointer of type librdf_query_results is NULL.
This is caused by failing to cath an error to the call to redland::executeQuery
.
queryResult <- redland::executeQuery(queryObj, x$model)
browseVignettes(package = "rdflib")
doesn't return any vignettes. I'm not that familiar with this functionality so I didn't debug it.
Noticed the test "we can parse from a url" calls out to the web but the skip_on_cran()
guard is commented out. Should it not be commented out?
Found documentation for rdf()
to be too terse and terse by comparison to other functions in the package
rdf_parse()
mentions the return is an 'rdf S3 object'. I might just call it an 'object' and omit the class system it was defined in
Verbage in rdf_serialize
'Serialize RDF docs' maybe should be 'Serialize an RDF Documnet' (no plural)
In rdf_serialize
, I wonder if the return should not be the raw numeric return code from redland
but either a logical or the document path itself (the latter would allow for piping)
rdf_query
states the return type is 'a list of all query results' when it actually appears to be a 'data.frame' from what I can tell
What the common argument 'x' is documented as across functions varies. "rdf graph object" "rdf object". Please use common language for these
RE: JOSS submission: I wonder if they won't find this package to simple/thin for acceptance. It does have clear research application though!
Thanks for getting your review in, @amoeba.
Hello all and apologies for the delay! Here is my review:
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 7
This package is a great and lightweight addition to working with rdf
and linked data in R. Coming after my review of the codemetar
package which introduced me to linked data, I found this a great learning experience into a topic I've become really interested in but am still quite novice in so I hope my feedback helps to appreciate that particular POV.
Overall I feel package functionality is complete and self-contained (apart from one error identified below). My main feedback is regarding documentation, specifically how it could be improved to help novice users to grasp the value of semantic data and better understand how the package works.
The only install comment I'll add is that when I first ran install(pkg_dir, dependencies = T, build_vignettes = T)
, the building of the vignettes threw an error because suggests package ‘jqr’
had not been installed yet? It worked without build_vignettes = T
pkg_dir <- "/Users/Anna/Documents/workflows/rOpenSci/reviews/rdflib-review/../rdflib"
devtools::install(pkg_dir, dependencies = T, build_vignettes = T)
#> Installing rdflib
#> '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \
#> --no-environ --no-save --no-restore --quiet CMD build \
#> '/Users/Anna/Documents/workflows/rOpenSci/reviews/rdflib' \
#> --no-resave-data --no-manual
#>
#> Error: Command failed (1)
with the console output:
* checking for file ‘/Users/Anna/Documents/workflows/rOpenSci/reviews/rdflib/DESCRIPTION’ ... OK
* preparing ‘rdflib’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
Quitting from lines 21-38 (rdflib.Rmd)
Error: processing vignette 'rdflib.Rmd' failed with diagnostics:
there is no package called 'jqr'
Execution halted
Installing without building the vignette results in successful installation of jqr
.
pkg_dir <- "/Users/Anna/Documents/workflows/rOpenSci/reviews/rdflib-review/../rdflib"
devtools::install(pkg_dir, dependencies = T)
#> Installing rdflib
#> '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \
#> --no-environ --no-save --no-restore --quiet CMD INSTALL \
#> '/Users/Anna/Documents/workflows/rOpenSci/reviews/rdflib' \
#> --library='/Users/Anna/Library/R/3.4/library' --install-tests
#>
#> Installing jqr
#> '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \
#> --no-environ --no-save --no-restore --quiet CMD INSTALL \
#> '/private/var/folders/8p/87cqdx2s34vfvcgh04l6z72w0000gn/T/RtmpbYeNu9/devtools6d3b4a7582a1/jqr' \
#> --library='/Users/Anna/Library/R/3.4/library' --install-tests
#>
if jqr
is installed, installation and vignette building proceeds successfully.
All OK
My main suggestion is to try to define some terms and improve the concept map for the tools by adding some detail and broader context to the documentation. The following suggestions could also be addressed with links to further details if you think they are too superfluous for explicit documentation with the package.
a brief intro to the semantic could be useful (eg something like):
The semantic web aims to link data in a machine readable way through the web, making data more alignable and interoperable, much easier to search, enriching and compute on.
what a graph format for data is (eg triples etc).
the structure of an rdf
S3 object
(ie you introduced some aspects of the data format here: (user does not have to manage world, model and storage objects by default just to perform standard operations and conversions)
which we are told we can ignore (which is great) but actually creates more questions... what is this mysterious "world" object that forms an opaque slot of an rdf S3 object?) Would be nice to explain the structure of the S3 rdf briefly. Is there usefull metadata that can be extracted from the structure? (see comment later)
rdf
file formats.
I think its would especially aid in appreciating the rdf_serialise
function to expand briefly (and potentially signpost to a resource like this) on the various serialization formats, perhaps even why one would use one over another, and particularly, why serialization involves writing a file out. I feel these are important concepts to help appreciate use cases of the function. Indeed the file out aspect of the function could do with being flagged more prominently in function man page. just by looking at the (quite cryptic if you don't know what serialization is) description and running the example, you've ended up writing a file without realising. But that's also just my test code, ask questions later
approach 😜
Similarly, parsing can then be seen/described as reading in/encoding an rdf
from their specific string formats.
Spelling a few things out in plain english and explicitly could really help folks follow what's going better and understand what file types are inputs or outputs of different functions.
Some signposting/guidance on how I can find information on the semantics dictating what information I can extract from an rdf
object would be really useful. eg. with a df
or list
you could use str
to get an idea of how you could start indexing these objects. If confronted with a local rdf
file, how would one go about figuring out even what they can query? I appreciate this is really one of the difficulties of working with rdf
and semantic data in general (the flipside to the ease of being able to make unstructured queries is that we need to know how data are labelled) but I feel some brief guidance or demo on how one would approach this would go a long way.
For clarity to the reader who may not have looked at function documentation yet, I recommend using the full argument names when supplying arguments to functions (if not always atleast the first time an argument is introduced) in vignettes.
At the end of the intro to the section, you write:
Here is a query that for all papers where I am an author, returns a table of given name, family name and year of publication:
Am I right in thinking though that you are co-author on all papers in the rdf but the query is in fact filtering the names of your co-authors? (through FILTER ( ?coi_family != "Boettiger" )
)
It would be nice if possible to see sample of print outs of the conversion of the different files or at least of the effect of compaction.
rdf_add
man pageWould be nice to see a demo of using one or more of the additonal arguments.
I think an additional, more detailed motivating example might illustrate more direct use case in a researchers workflow. In particular it would be good to highlight the great potential of triplestore APIs (and celebrate the efforts of many cool eg governmental linked data initiatives). So an example that incorporates a query to a triplestore and then enrichment of a researcher's data could be a cool example. This could be a longer term project or even just an rOpenSci blogpost but see comment re: rdf_query
function below.
Serialising to turtle
or trig
throws an error
library(magrittr)
library(rdflib)
doc <- system.file("extdata", "dc.rdf", package="redland")
doc %>%
rdf_parse() %>%
rdf_serialize(doc = "test.turtle", format = "turtle")
#> librdf error - serializer 'turtle' not found
#> rdf_serializer.c:597: (librdf_serializer_serialize_model_to_file) assertion failed: object pointer of #> type librdf_serializer is NULL.
doc <- system.file("extdata", "dc.rdf", package="redland")
doc %>%
rdf_parse() %>%
rdf_serialize(doc = "test.trig", format = "trig")
#> librdf error - serializer 'trig' not found
#> rdf_serializer.c:597: (librdf_serializer_serialize_model_to_file) assertion failed: object pointer of #> type librdf_serializer is NULL.
In rdf_query
, is there a way to return a non regularised query result ie return an rdf
instead?
I'm thinking about a usecase when maybe it's better to enrich data by merging rdf
s? ie, researcher queries a triples store through an API (yeyyy open data!), combines their not fully matching but interoperable rdf
data with rdf_add
(ie try to show how triplestore is better than tabular non-linked data for merging) and then queries the merged rdf
to extract an enriched analytical tabular dataset?
Add tests for being able to serialise to trig
and turtles
which at the moment is throwing an error?
Perhaps a test for parsing/serialising each format would be good. Also, perhaps worth checking whether eg rdf_parse(format="turtle")
is working.
👍
Thanks for getting your review in, @annakrystalli.
Now that both reviews are in, could you respond to the reviews and make changes as necessary, @cboettig? If possible, please do so within 2 weeks, which would be February 13.
@lmullen @annakrystalli @amoeba Thanks for your reviews!
I've just about finished addressing the issues raised at this point, which I've summarized in:
A summary of the changes can be found in NEWS.md, which ended up being reasonably involved because the reviews got me thinking about a bunch of stuff, which was awesome.
However, most substantive is perhaps the development of a new vignette, which I've liberally titled A tidyverse lover’s intro to RDF. This tries to address the big-picture issues Anna in particular highlights regarding documenting and motivating the broader context of RDF. This is still a bit more of a draft than a polished document, but given that my two weeks are up I think it might be a good time to get feedback on this (and the other changes) from the reviewers. In particular, I would love to hear what the reviewers think of this as a broader introduction.
If the reviewers are interested and think it would be worthwhile, I believe it might be nice to overhaul this new vignette into a more general purpose intro to RDF for R users (both the relevant packages and concepts) that might be suitable for a submission to something like the R Journal. I'd love entice Anna and Bryce to be co-authors if they are interested...
@cboettig Thanks for getting your review in on time, and for the thoroughness of the changes and how you reported them. I'm looking forward to reading the new vignette.
@amoeba and @annakrystalli: Could you please go over the changes to the package and report back within one week? That would be by Thursday, February 22. I'll do the same.
Hey @lmullen and @cboettig: I've reviewed the responses and changes @cboettig has made in response to my review and I every issue I raised has been addressed. I have no remaining issues and recommend the submission be accepted as modified. @lmullen would you like us to review the new vignette before acceptance? That's fine with me and I can certainly do that within the week.
@cboettig I'm super excited with the direction you're taking. I'd certainly like to continue working on this package and a paper. In particular, this clicks for me:
you can just about always get things down to about three columns,
I'd never before seen the equivalence between tidy data principles and RDF. I'll follow up with you elsewhere.
Hi all 😃
I am really happy with the changes made and the direction of the vignette! triplestores are indeed the ultimate tidy data! A great way to sell it. It's already a great resource and am also happy to contribute to both the vignette and a paper on it. I'll feedback to some of the discussions raised in @cboettig response inrdflib
issue.
So ✅ and big 👍 from me also.
@ameoba: Yes, if you could please offer whatever suggestions you think are necessary on the new vignette that would be great, but it seems like we are very close to being done.
Okay, will do. I'll get those comments in this week.
Hi @cboettig. Thanks for the thoroughness of your response to the reviews. At this point I don't see any reason to delay accepting the package into rOpenSci. Of course it looks like you are still figuring out the final form of a few things, especially in the new vignette, so it will be your call when to submit to CRAN.
@karthik I don't have access to rOpenSci admin accounts, so could you please begin the process of moving this package into the rOpenSci organization?
After that happens, @cboettig, could you please do the following?
Once the repository is moved I will close this issue.
Thank you @lmullen! Since Carl has ownership rights on the org (unlike most authors) he should be able to move this himself.
Thanks @lmullen and @karthik! I've added the ropensci footer and migrated the repo to ropensci
. 🚀 .
CI seems to be working (if I recall correctly I shouldn't migrate the appveyor since it only links to individual accounts?)
I assume there's nothing I need to do to update the onboarding badge, that happens automatically via tags on this issue, right?
I'll leave it to you editors to close out this thread when ready.
Thanks @cboettig & @lmullen!!
@cboettig Great! Looking forward to using this myself the next time I need to deal with RDF.
Special thanks to @amoeba and @annakrystalli for doing the review.
Summary
rdflib
is simply a wrapper around two existing ropensci packages:redland
andjsonld
, which should be a user-friendly complement to the low-level interface already provided byredland
for working with RDF (semantic/linked data).https://github.com/cboettig/rdflib
data extraction, because this package parses scientific data file formats. (specifically, formats already parsed by existing rOpenSci packages). This package also enables graph queries using the SPARQL language, somewhat analogous to the rOpenSci
jqr
package, but for JSON-LD and other linked data formats.Anyone working with semantic data, including the wide array of scientific ontologies and knowledge-bases. These include reproducibility-focused ontologies like PROV, and a large number of biological ontologies ranging from genes to traits to environmental features.
As described above, this package overlaps significantly with the
redland
package, but should be easier to use.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
[x] Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
Scott Chamberlain, @sckott
Peter Slaughter, @gothub
Bryce Mecum, @amoeba
Anna Krystalli, @annakrystalli