Review Ticket for 'Making a Small RDF Database'

jerielizabeth commented 8 years ago

The Programming Historian has received the following tutorial on 'Making a Small RDF Database' by @whanley. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/making-a-small-rdf-database

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

jerielizabeth commented 8 years ago

@whanley Thank you for this lesson! I have a couple of comments and suggestions that I will add here, hopefully by the end of the day today, and then we will move on to the "formal" peer review phase of the journey.

jerielizabeth commented 8 years ago

@whanley Thank you again for this lesson. I have read it through a couple times now, and already feel like I have a better handle on the goals of RDF and how it can be used.

A couple of questions and thoughts for you before I solicit external peer reviewers:

Logistics

Files for download and running SPARQL queries: I notice that you have opportunities at each step for readers to download a completed version of the file and to practice with queries. For the files, we can accommodate a copy of them on Programming Historian so that everything is in one place. For the SPARQL server, I am having trouble reaching yours and I am hesitant about the idea of running one as part of the journal (though I will follow up on this). We might want to rethink that element, or perhaps the order of events, so that people can test on their own systems as they go.
Difficulty level for the lesson: I think this is at least an intermediate tutorial, as different elements assume that users are comfortable with the commandline and SPARQL. What do you think?

Structure

Overall, I think the scope of this tutorial works really well - it guides the reader through the process of thinking through their data in a structure suited for RDF and also how to work with that data to clean it and to analyze it. At the end, it would be helpful to provide an outline of some next steps for the reader, but I think the stopping point is right.

One thing we like to do to structure lessons is to include a short overview at the beginning to give readers a sense of what they will create by the end of the tutorial and what software is required to complete the lesson. If you wouldn't mind adding a couple sentences along those lines, that would be great.

Questions

And finally, there were a couple places where I got ran into some trouble when working through the lesson.

I got a bit lost in step 3 with the assigning of an arbitrary URL for our made-up schema (in this case, http://mydb.org#) and the referencing of other schema URLs. If it isn't opening too large a can of worms, it would be helpful to have a little more context about what is going on here and why a made-up URL works.
In the same section, it might be helpful to have a small discussion about the different categories (for lack of a better term) of items that are being represented with "/id/" and "/doc/" etc. (if this can be done without drifting too far into schema design and the like.)
I am noticing some inconsistencies in the IDs assigned to the annotation and doc elements between the code examples and the completed files you've linked to. I am not sure it matters, but to avoid confusion it should be consistent.
As someone who works with Python and not XML, does the whitespace matter when creating the Turtle file?
When creating the data in Fuseki, is it recommended to use "In-memory" or "Persistent" as the dataset type?
In step 5, the first full query is missing the rdfs prefix. Also, having to call the prefixes was curious to me - I know it is covered in the SPARQL tutorial, but it would be useful to have a one sentence reminder of why they are necessary.
You mention using the update endpoint before the second update query, but not before the first update. I ran into some trouble here, so it would be useful to provide that information, and even a little context about the different endpoints, a little earlier in the section.
After the second update, it would be useful to provide the query to see the results of the update.

Happy to talk through any of these!

jerielizabeth commented 7 years ago

Hi @whanley! I hope the new semester is off to a decent start for you. I want to check in on the status of the lesson and see if you are ready for me to solicit peer reviewers. I think this lesson will be a very useful addition, so I hope that we can keep moving forward on it.

Best!

jerielizabeth commented 7 years ago

Greetings @whanley! I am checking to see if you have had time to work on edits for this lesson and if you are still interested in moving forward.

While we can always reopen it in the future, I will close this ticket on June 15 if I have not heard from you.

whanley commented 7 years ago

Thanks @jerielizabeth for your input and your patience! I've uploaded a new version of the lesson, as well as some auxiliary files.

Here are answers to your specific questions:

Logistics

Yes, .ttl files can be hosted with the lesson. Where should I put these--with the images? I understand the concern about problems with referring to an external SPARQL endpoint, especially considering what happened with the SPARQL lesson last year. Also, the endpoint I host was down for a couple months last fall, when you looked for it. That said, I think that having a chance to experiment a bit with the data via an endpoint is a useful way to extend the lesson, and I think it may be hard to find any externally supported endpoint that would do this reliably, and even harder for PH to contrive to host one themselves. So perhaps we should leave the link in with the understanding that it is not essential to the lesson, and I should try harder to make sure that the endpoint is always running?
Yes, intermediate sounds about right--but reviewers might be better judges. I think the lesson follows on well from the two existing LOD lessons. I'm going to start work on an advanced lesson on ontologies.

Structure

I've added an overview and next steps.

Questions

I've added a paragraph earlier on in step three that aims to clarify URI vs URL.
I've also tried to clarify this with a slightly longer explanation.
I think (hope!) I've resolved these discrepancies.
I added a mention that whitespace doesn't matter.
Added "in-memory" suggestion.
Thanks for catching the mistake, and I've added a sentence explaining the declaration.
Yes, thanks--I've reversed the order of the examples and made this much clearer.
Added this now too.

I've made a good number of other changes as well. I hope it all makes more sense now.

jerielizabeth commented 7 years ago

@whanley Great to hear from you! I'm a bit swamped until the end of next week, but at that point I will look things over again and, I anticipate, start recruiting reviewers.

jerielizabeth commented 7 years ago

Hi @whanley this is looking great! Thank you for making the changes. For the hosting of the files on Programming Historian, the best place would be in the assets folder, in a folder named 'making-a-small-rdf-database'. And I'll ask the team for additional ideas on strategies for the SPARQL endpoint and how best to proceed for sustainability purposes.

The last concern I have before recruiting reviewers is with the very last section (paragraphs 55-57). This might be easily resolved with wording, but it's rather disappointing to get to the end, and find that the very useful feature of being able to associate fields (rather than standardize them), is not something that I can do with the information in the lesson. Perhaps a "next steps" section where we can entice people to go through the effort of installing one of the more robust engines with the promise of this type of linking?

Thank you for all your work on this! I think this is a great lesson and I'm excited to see it finalizing!

jerielizabeth commented 7 years ago

Hi again @whanley. I conferred with our resident sustainability and SPARQL experts, and the consensus is that it would be best to avoid the hosted endpoint server and focus instead on making sure that the reader has their own server running locally and is exploring their RDF data that way. That reduces the maintenance burden on everyone involved and puts the lesson on the strongest ground for long-term use.

I think to keep the lesson at intermediate, you should probably keep the focus on the data structure and the more simple interactions that users can accomplish with the Fuseki server. If the reader installed the Fuseki server earlier, would they be able to use the files you provide to experiment with the data manipulations? (moving this direction might make the lesson a bit longer, but I think it will strengthen it overall.)

whanley commented 6 years ago

Hi @jerielizabeth. I decided to change the server program I recommend in the lesson. I've substituted GraphDB for Fuseki. It's easier to install, and it offers quite a few more features, most importantly inference support. I've changed the set of sample queries as a result, and the last query works now. I'm working on a next lesson, as well, which will extend the work on inferencing.

Hope it's ready for review.

jerielizabeth commented 6 years ago

Hi @whanley. Thank you for making those changes!

I plan to look over the lesson next week and will be in touch!

jerielizabeth commented 6 years ago

Hi @whanley! Thank you so much for these changes. I know it's been a bit of a moving target, but i think the lesson is in great shape for it.

I think we are indeed ready to send the lesson out for review! I'll send out a few inquiries for potential reviewers, who will have a month to complete the review. I'll let you know once the reviewers are confirmed. Once both reviews are completed, I'll look them over, summarize the feedback, and make some final suggestions. You'll then have four weeks to respond, get clarification on suggestions, and make any necessary changes.

Thank you again for all your work on this!

jerielizabeth commented 6 years ago

Update on the review for the lesson: James Smith and Bronwen Masemann have agreed to review the lesson, due February 5. They will post their feedback here.

jgsmith commented 6 years ago

Overall, I think this is a great lesson. It's definitely aimed at someone who isn't familiar with all of the technologies, but also isn't afraid of a text editor. The choice of GraphDB seems reasonable. I'll probably point my students to this sequence of tutorials as extra material.

Here are some quick notes and reactions as I read through the lesson. I've grouped them by section heading. Keep in mind that they are from someone who knows a bit more than the expected audience. Just as advanced study of a subject draws out things that are glossed over in less advanced courses, some of the things I point out can probably be left out to come back in future lessons. The main thing is to make sure readers can't poke holes in what is here.

Overview

Consider linking to outside resources such as Wikipedia when using new terms for the first time. For example, a link to the Wikipedia page for RDF might be useful in the first sentence even though the lesson you link to later, the Intro to the Principles of Linked Data, does provide the link when discussing RDF and data formats.

Why RDF?

Neo4j is more of a database than an application. In the same way that MySQL and Postgres are databases rather than applications. WordPress might be an application that uses a database (in this case MySQL) to store blog posts. A game is an application that might use a database (perhaps Neo4j) to track object adjacencies (I'm standing near the bookcase leaning against the west wall in the parlor).
Neo4j has a Sparql plugin: http://neo4j-contrib.github.io/sparql-plugin/

An example

I'm not sure that the third option (XML) requires an elaborate customized schema, but this is because I find fault with most people's use of TEI, which tends to be the typical XML vocabulary in textual scholarship. The elaborate customized schemas that people use make TEI not work as a standard, but that's not the fault of TEI. That's the fault of everyone considering their documents to be snowflakes when they aren't. More fundamentally, elaborate customized schemas are really something that's done in the second option (relational database). If we take option two and blindly turn it into XML, then we have the elaborate customized schema from option two as XML.
RDF as XML falls under the third option of an elaborate customized schema, especially if we create custom predicates/ontologies/vocabularies. I'd be careful about calling out XML specifically as a problem, but rather the way everyone creates their own vocabularies and end up with just as many silos as before. When I teach the LoD course at DHSI, I encourage people to find vocabularies already in use elsewhere and create new ones only when they can't make those work, with the understanding that custom vocabularies aren't used by anyone except them. I don't have any thoughts on how to do this quickly in a small paragraph, but I wouldn't want a discussion of this to sidetrack everything else. Just be aware of the potential read on the issue of XML and elaborate custom stuff. (You do address this some in paragraph 25, but I think it could use some more weight.)
If RDF is an alternative to "a lot of explanation," you might have an opening to discuss the issue in the last item: custom vocabularies vs. vocabularies used by others in the community. Custom vocabularies require "a lot of explanation." You do address this a bit in paragraph 25, but it might be useful to call it out just a bit more so readers don't gloss over it. You're having to work against the momentum in the humanities on this. The standard practice seems to be to create custom vocabularies because no one's primary materials are like anyone else's. Of course, they largely are. I find that a little discussion with materials in hand helps. That might not be possible in this space.

Step 3: Translate into Machine-readable Language

Not all RDF documents start with a declaration. RDF/JSON is an example (as opposed to JSON-LD with its @context property). RDF/XML doesn't start with any more declaration than a regular XML document, namely, the namespace prefix mappings. But those can appear anywhere in the document, or nowhere if they aren't used.
The @prefix mydb: <...> looks like a triple, but only by accident. I would shy away from forcing a similarity into something more. In RDF/XML, it would be xmlns:mydb="...", which doesn't look like a triple. The prefix declaration isn't part of RDF, but of the serialization format that the RDF is being poured into.
👍 I like pointing people to schema.org as a solid core. It helps tie their stuff in with the larger world seen by search engines and others in the public space.
some other vocabularies that could be useful in the example: http://vocab.org/relationship/, http://vocab.org/bio/ . Not sure if these vocabularies are in widespread use, but they do exist. Might be useful as a follow on section showing how multiple triples can encode the same information to make it accessible to a larger audience. Similar to the equivalent property triple in paragraph 57/58.

Conclusion and next steps

I'd warn anyone about publishing a SPARQL endpoint. It's no different than publishing a SQL endpoint. https://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql-endpoint/ http://drjamesmalone.blogspot.com/2015/08/ . Rather, provide a dump of the data in a place like github or provide a REST or GraphQL interface that allows better resource management and introspection. My preference is to share data as a static file and bring all of the data together into my own SPARQL database for my own use. Moving RDF into the realm of Linked Data encourages the use of REST apis since SPARQL isn't really useful when given a resource URL and looking for the information provided by that URL. Pushing this a bit farther: use Github to publish static JSON or other files that provide a read-only REST api as a github site. This provides a history of changes in the data, lets people access it for import into their favorite tool, and works well with Linked Data.

jerielizabeth commented 6 years ago

Thank you, @jgsmith! These are really helpful comments and suggestions.

@whanley I will wait to hear from Bronwen before summarizing the reviews and offering guidance. You are welcome to chat with @jgsmith about his suggestions, but please don't make any changes to the lesson until the second review is in!

Thanks!

BronwenMasemann commented 6 years ago

I found this lesson overall very clear and interesting. I think the example of transcribing and storing primary source data will be intriguing to a variety of potential users. In contrast to the other reviewer, I came to this process as a person who had less knowledge of specific tools, but a fair bit of experience of teaching (information school) students about data models, metadata, and RDF. So my comments have more to do with presentation and communication.

Here they are:

para. 1 - To whet the appetite of the reader, I would suggest clarifying that this process enables not just recording of data but also manipulation.

para. 2 - I would suggest using a term other than “serial record.” What are the characteristics of the records that lead you to call them “serial records”? To librarians, “serial” generally means “issued periodically” (and a “serial record” means “the catalog record for a serial publication”). So I would avoid using the term “serial”. My understanding is that the characteristic of these records that make them appropriate for this kind of treatment is that they contain structured data. Therefore I think it would be more clear if you said “I often come across documents that contain information that is structured” or perhaps “documents that contain information that is structured and repetitive.”

para. 4. This paragraph introduces terminology and concepts, only to state that they will be skipped over. I think it would be more helpful to readers if you eliminated or moved much of this paragraph, and just included the sentences beginning “This tutorial . . .” and “It employs. . . "

The heading “an example”: I would provide a stronger heading here to explain the goal of this section. Perhaps “The problem of transcribing and storing structured data in documents.”

para. 8 - Instead of stating that this record was “already a database” I would suggest making the weaker claim that the data was already structured. I think that some readers would argue that whatever structured storage and retrieval system was being used (register? card file?) cannot be properly called a database.

para. 14 - I agree with the other reviewer’s comment that the use of XML would not necessarily require an “elaborate customized schema” and I as well encourage my students to consider what is available and then tweak it rather than reinventing the wheel. I am not sure if this is intentional but the structure you have set up here, of examining the three options, is very similar to that used in Hooland, S. van, & Verborgh, R. (2014). Chapter 2: Modelling. In Linked data for libraries, archives and museums: how to clean, link and publish your metadata. Chicago: ALA Editions. My students consistently tell me that this chapter is extremely clear and useful, and it may be helpful to you in sorting out how to express the distinctions between the options you present.

para. 20 and ff. - Overall I think that what would most strengthen your already excellent walk-through of your example is to include visual representations of the relationships between the entities. I’d recommend using the same format as the visuals in the Linked Open Data tutorial: https://programminghistorian.org/images/intro-to-linked-data/intro-to-linked-data-fig5.png.

para. 22 - I like your idea of thinking about what you are doing to enable a machine to read your data. However I would argue that starting right with step 1, the data is already machine readable - a machine could for example count the number of characters in the file. I would clarify here that what you are making machine readable is the semantic structure of the data - the specific identity of each entity, and the nature of the relationships between them.

para. 28 - Just to remind your reader, I would state at the end of this paragraph that person 1 is Mirzan Marie.

para. 35 - “to try it out.” What process will the reader be trying out?

para. 38 - I would link here to a tutorial on regular expressions.

para. 43 - I think a less confident reader would find the inclusion of links to other tools in this paragraph distracting. I would shift the intro to other options to the end of the tutorial. Similarly, I don’t think it’s necessary here to qualify GraphDB here as “far from the last word” as this distracts from what it is actually able to do.

jerielizabeth commented 6 years ago

Thank you, @BronwenMasemann! I am glad to see the feedback covering both the technical and the presentation aspects of the project!

I will read through the reviews in the next few days and get a summery to you, @whanley, by the end of the week!

jerielizabeth commented 6 years ago

Alright! Thank you, @jgsmith and @BronwenMasemann, for these excellent review!

This is a good combination of specific ideas and some general patterns to consider. As the author, @whanley you of course have the final say on if and how the suggestions are incorporated.

In terms of general patterns, both reviews express concern about the balance between customized and standard schemas, encouraging a stronger emphasis on standardized vocabularies. I think that is a good point to consider, though I don't think that would require restructuring, just increasing the emphasis in places. Both reviewers expressed concern about overwhelming readers with concepts in the opening section. I am going to pull in @alsalin to confirm about using external links to Wikipedia in terms of sustainability practices, but if she agrees, I think that makes sense (fingers crossed that those don't change often.)

@BronwenMasemann's suggestions about linking to other Programming Historian lessons is, of course, strongly encouraged, and I like her suggestion of using similar format and visual strategies. For next steps, I think @jgsmith's suggestion about encouraging file distribution in places like Github are good, and gives a next step that does not require another lesson.

Overall, though, these reviews are very positive and offer great suggestions for refining the lesson and making it that much stronger.

The next step is revision, with a preferred timeline of 4 weeks, making the due date March 15. @whanley, feel free to ask for clarification from myself or the reviewers as you edit, and let me know if you need additional time for final the changes.

Thank you again to everyone!

whanley commented 6 years ago

Thank you very much @jerielizabeth @jgsmith @BronwenMasemann. Your suggestions really improve this piece, and I think I'll be able to integrate almost all of them. I hope to get to this next week. Much appreciation.

alsalin commented 6 years ago

@jerielizabeth apologies for the delay, but this notification got buried in my email. As for wikipedia links: using them for definitions of common terms is still considered good sustainable practice. We should encourage authors to use the permalink for an article (a snapshot of the wiki article at a given time) instead of the general URL (Wikipedia's preference for citation).

jerielizabeth commented 6 years ago

Greetings @whanley! I am checking in to see how the revisions are going and to see whether you need more time.

When you do push the changes, please include the issue number (#31) in the commit message so that I can track it back easily. (https://blog.github.com/2011-10-12-introducing-issue-mentions/)

jerielizabeth commented 6 years ago

Hi @whanley! I hope the semester has wrapped up (or is wrapping up) smoothly! Any updates from you as to when you'll be able to push up your revisions for this lesson? I would love to see this published in the next few weeks, as I am starting the process of rotating off the editorial board. Thanks!

jerielizabeth commented 6 years ago

@whanley checking in one last time before I head off the editorial team. I would really like to see your lesson through to publication, so I hope you can check back in and let me know where you're at within the next week. Thanks!

acrymble commented 6 years ago

I propose closing this submission. @jerielizabeth you have attempted many times to reach out to the author.

jerielizabeth commented 6 years ago

Thank you again to the reviewers on this lesson - @BronwenMasemann and @jgsmith - for your excellent feedback. I am going to close this issue, as I have not heard from @whanley. The lesson can be revived by the author in the future, but will be published under a new editor.

Best,

Jeri

programminghistorian / ph-submissions