programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
138 stars 113 forks source link

Review Ticket for 'Making a Small RDF Database' #31

Closed jerielizabeth closed 6 years ago

jerielizabeth commented 8 years ago

The Programming Historian has received the following tutorial on 'Making a Small RDF Database' by @whanley. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/making-a-small-rdf-database

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

jerielizabeth commented 8 years ago

@whanley Thank you for this lesson! I have a couple of comments and suggestions that I will add here, hopefully by the end of the day today, and then we will move on to the "formal" peer review phase of the journey.

jerielizabeth commented 8 years ago

@whanley Thank you again for this lesson. I have read it through a couple times now, and already feel like I have a better handle on the goals of RDF and how it can be used.

A couple of questions and thoughts for you before I solicit external peer reviewers:

Logistics

  1. Files for download and running SPARQL queries: I notice that you have opportunities at each step for readers to download a completed version of the file and to practice with queries. For the files, we can accommodate a copy of them on Programming Historian so that everything is in one place. For the SPARQL server, I am having trouble reaching yours and I am hesitant about the idea of running one as part of the journal (though I will follow up on this). We might want to rethink that element, or perhaps the order of events, so that people can test on their own systems as they go.
  2. Difficulty level for the lesson: I think this is at least an intermediate tutorial, as different elements assume that users are comfortable with the commandline and SPARQL. What do you think?

Structure

Overall, I think the scope of this tutorial works really well - it guides the reader through the process of thinking through their data in a structure suited for RDF and also how to work with that data to clean it and to analyze it. At the end, it would be helpful to provide an outline of some next steps for the reader, but I think the stopping point is right.

One thing we like to do to structure lessons is to include a short overview at the beginning to give readers a sense of what they will create by the end of the tutorial and what software is required to complete the lesson. If you wouldn't mind adding a couple sentences along those lines, that would be great.

Questions

And finally, there were a couple places where I got ran into some trouble when working through the lesson.

  1. I got a bit lost in step 3 with the assigning of an arbitrary URL for our made-up schema (in this case, http://mydb.org#) and the referencing of other schema URLs. If it isn't opening too large a can of worms, it would be helpful to have a little more context about what is going on here and why a made-up URL works.
  2. In the same section, it might be helpful to have a small discussion about the different categories (for lack of a better term) of items that are being represented with "/id/" and "/doc/" etc. (if this can be done without drifting too far into schema design and the like.)
  3. I am noticing some inconsistencies in the IDs assigned to the annotation and doc elements between the code examples and the completed files you've linked to. I am not sure it matters, but to avoid confusion it should be consistent.
  4. As someone who works with Python and not XML, does the whitespace matter when creating the Turtle file?
  5. When creating the data in Fuseki, is it recommended to use "In-memory" or "Persistent" as the dataset type?
  6. In step 5, the first full query is missing the rdfs prefix. Also, having to call the prefixes was curious to me - I know it is covered in the SPARQL tutorial, but it would be useful to have a one sentence reminder of why they are necessary.
  7. You mention using the update endpoint before the second update query, but not before the first update. I ran into some trouble here, so it would be useful to provide that information, and even a little context about the different endpoints, a little earlier in the section.
  8. After the second update, it would be useful to provide the query to see the results of the update.

Happy to talk through any of these!

jerielizabeth commented 7 years ago

Hi @whanley! I hope the new semester is off to a decent start for you. I want to check in on the status of the lesson and see if you are ready for me to solicit peer reviewers. I think this lesson will be a very useful addition, so I hope that we can keep moving forward on it.

Best!

jerielizabeth commented 7 years ago

Greetings @whanley! I am checking to see if you have had time to work on edits for this lesson and if you are still interested in moving forward.

While we can always reopen it in the future, I will close this ticket on June 15 if I have not heard from you.

whanley commented 7 years ago

Thanks @jerielizabeth for your input and your patience! I've uploaded a new version of the lesson, as well as some auxiliary files.

Here are answers to your specific questions:

Logistics

  1. Yes, .ttl files can be hosted with the lesson. Where should I put these--with the images? I understand the concern about problems with referring to an external SPARQL endpoint, especially considering what happened with the SPARQL lesson last year. Also, the endpoint I host was down for a couple months last fall, when you looked for it. That said, I think that having a chance to experiment a bit with the data via an endpoint is a useful way to extend the lesson, and I think it may be hard to find any externally supported endpoint that would do this reliably, and even harder for PH to contrive to host one themselves. So perhaps we should leave the link in with the understanding that it is not essential to the lesson, and I should try harder to make sure that the endpoint is always running?

  2. Yes, intermediate sounds about right--but reviewers might be better judges. I think the lesson follows on well from the two existing LOD lessons. I'm going to start work on an advanced lesson on ontologies.

Structure

I've added an overview and next steps.

Questions

  1. I've added a paragraph earlier on in step three that aims to clarify URI vs URL.
  2. I've also tried to clarify this with a slightly longer explanation.
  3. I think (hope!) I've resolved these discrepancies.
  4. I added a mention that whitespace doesn't matter.
  5. Added "in-memory" suggestion.
  6. Thanks for catching the mistake, and I've added a sentence explaining the declaration.
  7. Yes, thanks--I've reversed the order of the examples and made this much clearer.
  8. Added this now too.

I've made a good number of other changes as well. I hope it all makes more sense now.

jerielizabeth commented 7 years ago

@whanley Great to hear from you! I'm a bit swamped until the end of next week, but at that point I will look things over again and, I anticipate, start recruiting reviewers.

jerielizabeth commented 7 years ago

Hi @whanley this is looking great! Thank you for making the changes. For the hosting of the files on Programming Historian, the best place would be in the assets folder, in a folder named 'making-a-small-rdf-database'. And I'll ask the team for additional ideas on strategies for the SPARQL endpoint and how best to proceed for sustainability purposes.

The last concern I have before recruiting reviewers is with the very last section (paragraphs 55-57). This might be easily resolved with wording, but it's rather disappointing to get to the end, and find that the very useful feature of being able to associate fields (rather than standardize them), is not something that I can do with the information in the lesson. Perhaps a "next steps" section where we can entice people to go through the effort of installing one of the more robust engines with the promise of this type of linking?

Thank you for all your work on this! I think this is a great lesson and I'm excited to see it finalizing!

jerielizabeth commented 7 years ago

Hi again @whanley. I conferred with our resident sustainability and SPARQL experts, and the consensus is that it would be best to avoid the hosted endpoint server and focus instead on making sure that the reader has their own server running locally and is exploring their RDF data that way. That reduces the maintenance burden on everyone involved and puts the lesson on the strongest ground for long-term use.

I think to keep the lesson at intermediate, you should probably keep the focus on the data structure and the more simple interactions that users can accomplish with the Fuseki server. If the reader installed the Fuseki server earlier, would they be able to use the files you provide to experiment with the data manipulations? (moving this direction might make the lesson a bit longer, but I think it will strengthen it overall.)

whanley commented 6 years ago

Hi @jerielizabeth. I decided to change the server program I recommend in the lesson. I've substituted GraphDB for Fuseki. It's easier to install, and it offers quite a few more features, most importantly inference support. I've changed the set of sample queries as a result, and the last query works now. I'm working on a next lesson, as well, which will extend the work on inferencing.

Hope it's ready for review.

jerielizabeth commented 6 years ago

Hi @whanley. Thank you for making those changes!

I plan to look over the lesson next week and will be in touch!

jerielizabeth commented 6 years ago

Hi @whanley! Thank you so much for these changes. I know it's been a bit of a moving target, but i think the lesson is in great shape for it.

I think we are indeed ready to send the lesson out for review! I'll send out a few inquiries for potential reviewers, who will have a month to complete the review. I'll let you know once the reviewers are confirmed. Once both reviews are completed, I'll look them over, summarize the feedback, and make some final suggestions. You'll then have four weeks to respond, get clarification on suggestions, and make any necessary changes.

Thank you again for all your work on this!

jerielizabeth commented 6 years ago

Update on the review for the lesson: James Smith and Bronwen Masemann have agreed to review the lesson, due February 5. They will post their feedback here.

jgsmith commented 6 years ago

Overall, I think this is a great lesson. It's definitely aimed at someone who isn't familiar with all of the technologies, but also isn't afraid of a text editor. The choice of GraphDB seems reasonable. I'll probably point my students to this sequence of tutorials as extra material.

Here are some quick notes and reactions as I read through the lesson. I've grouped them by section heading. Keep in mind that they are from someone who knows a bit more than the expected audience. Just as advanced study of a subject draws out things that are glossed over in less advanced courses, some of the things I point out can probably be left out to come back in future lessons. The main thing is to make sure readers can't poke holes in what is here.

Overview

Why RDF?

An example

Step 3: Translate into Machine-readable Language

Conclusion and next steps

jerielizabeth commented 6 years ago

Thank you, @jgsmith! These are really helpful comments and suggestions.

@whanley I will wait to hear from Bronwen before summarizing the reviews and offering guidance. You are welcome to chat with @jgsmith about his suggestions, but please don't make any changes to the lesson until the second review is in!

Thanks!

BronwenMasemann commented 6 years ago

I found this lesson overall very clear and interesting. I think the example of transcribing and storing primary source data will be intriguing to a variety of potential users. In contrast to the other reviewer, I came to this process as a person who had less knowledge of specific tools, but a fair bit of experience of teaching (information school) students about data models, metadata, and RDF. So my comments have more to do with presentation and communication.

Here they are:

para. 1 - To whet the appetite of the reader, I would suggest clarifying that this process enables not just recording of data but also manipulation.

para. 2 - I would suggest using a term other than “serial record.” What are the characteristics of the records that lead you to call them “serial records”? To librarians, “serial” generally means “issued periodically” (and a “serial record” means “the catalog record for a serial publication”). So I would avoid using the term “serial”. My understanding is that the characteristic of these records that make them appropriate for this kind of treatment is that they contain structured data. Therefore I think it would be more clear if you said “I often come across documents that contain information that is structured” or perhaps “documents that contain information that is structured and repetitive.”

para. 4. This paragraph introduces terminology and concepts, only to state that they will be skipped over. I think it would be more helpful to readers if you eliminated or moved much of this paragraph, and just included the sentences beginning “This tutorial . . .” and “It employs. . . "

The heading “an example”: I would provide a stronger heading here to explain the goal of this section. Perhaps “The problem of transcribing and storing structured data in documents.”

para. 8 - Instead of stating that this record was “already a database” I would suggest making the weaker claim that the data was already structured. I think that some readers would argue that whatever structured storage and retrieval system was being used (register? card file?) cannot be properly called a database.

para. 14 - I agree with the other reviewer’s comment that the use of XML would not necessarily require an “elaborate customized schema” and I as well encourage my students to consider what is available and then tweak it rather than reinventing the wheel. I am not sure if this is intentional but the structure you have set up here, of examining the three options, is very similar to that used in Hooland, S. van, & Verborgh, R. (2014). Chapter 2: Modelling. In Linked data for libraries, archives and museums: how to clean, link and publish your metadata. Chicago: ALA Editions. My students consistently tell me that this chapter is extremely clear and useful, and it may be helpful to you in sorting out how to express the distinctions between the options you present.

para. 20 and ff. - Overall I think that what would most strengthen your already excellent walk-through of your example is to include visual representations of the relationships between the entities. I’d recommend using the same format as the visuals in the Linked Open Data tutorial: https://programminghistorian.org/images/intro-to-linked-data/intro-to-linked-data-fig5.png.

para. 22 - I like your idea of thinking about what you are doing to enable a machine to read your data. However I would argue that starting right with step 1, the data is already machine readable - a machine could for example count the number of characters in the file. I would clarify here that what you are making machine readable is the semantic structure of the data - the specific identity of each entity, and the nature of the relationships between them.

para. 28 - Just to remind your reader, I would state at the end of this paragraph that person 1 is Mirzan Marie.

para. 35 - “to try it out.” What process will the reader be trying out?

para. 38 - I would link here to a tutorial on regular expressions.

para. 43 - I think a less confident reader would find the inclusion of links to other tools in this paragraph distracting. I would shift the intro to other options to the end of the tutorial. Similarly, I don’t think it’s necessary here to qualify GraphDB here as “far from the last word” as this distracts from what it is actually able to do.

jerielizabeth commented 6 years ago

Thank you, @BronwenMasemann! I am glad to see the feedback covering both the technical and the presentation aspects of the project!

I will read through the reviews in the next few days and get a summery to you, @whanley, by the end of the week!

jerielizabeth commented 6 years ago

Alright! Thank you, @jgsmith and @BronwenMasemann, for these excellent review!

This is a good combination of specific ideas and some general patterns to consider. As the author, @whanley you of course have the final say on if and how the suggestions are incorporated.

In terms of general patterns, both reviews express concern about the balance between customized and standard schemas, encouraging a stronger emphasis on standardized vocabularies. I think that is a good point to consider, though I don't think that would require restructuring, just increasing the emphasis in places. Both reviewers expressed concern about overwhelming readers with concepts in the opening section. I am going to pull in @alsalin to confirm about using external links to Wikipedia in terms of sustainability practices, but if she agrees, I think that makes sense (fingers crossed that those don't change often.)

@BronwenMasemann's suggestions about linking to other Programming Historian lessons is, of course, strongly encouraged, and I like her suggestion of using similar format and visual strategies. For next steps, I think @jgsmith's suggestion about encouraging file distribution in places like Github are good, and gives a next step that does not require another lesson.

Overall, though, these reviews are very positive and offer great suggestions for refining the lesson and making it that much stronger.

The next step is revision, with a preferred timeline of 4 weeks, making the due date March 15. @whanley, feel free to ask for clarification from myself or the reviewers as you edit, and let me know if you need additional time for final the changes.

Thank you again to everyone!

whanley commented 6 years ago

Thank you very much @jerielizabeth @jgsmith @BronwenMasemann. Your suggestions really improve this piece, and I think I'll be able to integrate almost all of them. I hope to get to this next week. Much appreciation.

alsalin commented 6 years ago

@jerielizabeth apologies for the delay, but this notification got buried in my email. As for wikipedia links: using them for definitions of common terms is still considered good sustainable practice. We should encourage authors to use the permalink for an article (a snapshot of the wiki article at a given time) instead of the general URL (Wikipedia's preference for citation).

jerielizabeth commented 6 years ago

Greetings @whanley! I am checking in to see how the revisions are going and to see whether you need more time.

When you do push the changes, please include the issue number (#31) in the commit message so that I can track it back easily. (https://blog.github.com/2011-10-12-introducing-issue-mentions/)

jerielizabeth commented 6 years ago

Hi @whanley! I hope the semester has wrapped up (or is wrapping up) smoothly! Any updates from you as to when you'll be able to push up your revisions for this lesson? I would love to see this published in the next few weeks, as I am starting the process of rotating off the editorial board. Thanks!

jerielizabeth commented 6 years ago

@whanley checking in one last time before I head off the editorial team. I would really like to see your lesson through to publication, so I hope you can check back in and let me know where you're at within the next week. Thanks!

acrymble commented 6 years ago

I propose closing this submission. @jerielizabeth you have attempted many times to reach out to the author.

jerielizabeth commented 6 years ago

Thank you again to the reviewers on this lesson - @BronwenMasemann and @jgsmith - for your excellent feedback. I am going to close this issue, as I have not heard from @whanley. The lesson can be revived by the author in the future, but will be published under a new editor.

Best,

Jeri