w3c / scholarly-html

Repository for the Scholarly HTML Community Group
35 stars 26 forks source link

Improve RDFa examples #4

Open csarven opened 8 years ago

csarven commented 8 years ago

Utilise RDFa in the examples without repeating or hiding data. Reformulate the examples.

At the moment, the RDFa examples appear to encourage:

  1. repetition of data (violating the DRY principle with no gain) e.g., meta with schema:givenName, instead of factoring it within schema:name. If the intention is to have different purposes, i.e., the display 'name' being different than the givenName and familyName combination, describe the examples better instead of making premature generalisation with meta.
  2. hidden data, which is always at risk of getting desynchronised, and difficult to maintain.

For example, instead of doing stuff like:

<span typeof="schema:Person" resource="http://orcid.org/0000-0003-1279-3709">
  <meta property="schema:givenName" content="Bruce">
  <meta property="schema:familyName" content="Banner">
  <a href="http://orcid.org/0000-0003-1279-3709">
    <span property="schema:name">Dr. Bruce Banner</span>
  </a>
</span>

consider demonstrating along the lines of:

<a resource="http://orcid.org/0000-0003-1279-3709" typeof="schema:Person"
  href="http://orcid.org/0000-0003-1279-3709">
  <span property="schema:name">Dr.
    <span property="schema:givenName">Bruce</span>
    <span property="schema:familyName">Banner</span>
  </span>
</a>

The latter uses fewer nodes, and completely human-visible and machine-readable.

The examples should try to utilise as much of the HTML as possible e.g., the @href and @resource are repeated above - this is not wrong per se. The examples should formulate or at least consider the difference between the thing that the subject is about, and perhaps the relation to the @href e.g., schema:page, schema:url or whatever.

darobin commented 8 years ago

As the initial document explained (a part that we still need to bring back here) the format is not intended for human authoring. If that were the case, your example above would be a solid improvement but then again if human authoring were intended I would simply not use RDFa at all.

As input the processor is likely to get something like:

{
  "@id": "http://orcid.org/0000-0003-1279-3709",
  "@type": "Person",
  "name": "Dr. Bruce Banner",
  "givenName": "Bruce",
  "familyName": "Banner"
}

Obviously, it is possible to design an algorithm so that the latter could be generated, but I would be far more concerned about getting it wrong than I am with the simple data model dump.

It also has knock-on effects that are likely to prove surprising to people. If I style schema:givenName separately I would not expect it to have an influence on schema:name.

I am also considering an authoring format, but that does not need to be standardised.

csarven commented 8 years ago

Why did you assume that what I've suggested is for "human authoring" (hand-coding)?

The examples in the document should aim for data with high human-visibility and machine-readability and avoid repetition as much as possible. If a tooling wants to produce hidden, clutter markup and data, that's its own calling. However, we don't need to arbitrarily bring everything else down to that level. I strongly prefer to see high quality markup/data and smart tooling which can generate it.

If simplicity is the goal, we can of course output a hidden table with three columns - no please don't do this - embed Turtle or JSON-LD in HTML (which have different use-cases by the idea holds). But IMO, it is not all about simplicity

darobin commented 8 years ago

I assumed you're targeting human authoring because otherwise I don't understand the motivation for what you're proposing, beyond coding style preferences. Considering that it requires more complex tooling to produce reliably and that it may lead to processing issues, what are the measurable advantages that offset the added cost?

csarven commented 8 years ago

The measurable advantages are like I've said in both comments above i.e., there is redundant markup and data, and some of that data is hidden from humans when there is absolutely no need - that's simply bad practice. If the focus is on serving for the machines, alternative RDF syntaxes would be more appropriate. By hiding and duplicating data, it is not taking advantage of RDFa to its full potential. The point of the examples is about the practice to publish. Whether you arrive at that by hand coding, using tools or by other means entirely orthogonal.

Can you point me to a study which suggests that what I'm proposing is "more complex tooling to produce reliably and that it may lead to processing issues"? What are the measurable advantages that offset the added cost of duplicating and hiding data? What you are doing is basically dictating this whole scenario towards the development of simple and dumb tools. And, the main focus here is not about the tools, but about the quality of the publication of "scholarly HTML". It is important for the examples to reflect that.

darobin commented 8 years ago

I don't need a study to know that an algorithm that takes three arbitrary strings and finds non-overlapping occurrences of the first two in the last one in order to mark it up correctly, properly taking into account small details like word boundary detection in arbitrary (possibly unknown) languages is a lot more complicated and brittle than listing the same three strings with a different element name depending on whether the third one is present or not. This paragraph could be the whole study: option A requires internationalisation and algorithmic thinking, whereas option B is a single if-else statement; option B is much simpler.

I am not sure where you see "redundant markup" and "duplicating data". Maybe you are under the impression that there is some form of mapping between givenName/familyName/etc. and the display name? There isn't. People often use a given/family concatenation but even in that case the order is not predictable. It certainly changes between languages, and in fact it even changes inside locales (e.g. in French older or more traditional/conservative people will use "family, given" instead of "given family" because that's how it was done until ~1960s). People may initial their given or additional names (or any combination thereof), or use an additional name as the one they prefer to be addressed as. Then there are hypocoristic forms or nicknames that people commonly use, like Tim or Dan. There are married women whose familyName comes from their husband but doesn't appear in the name (or vice-versa after divorce). And you can write books on preferences in honorifics.

So the alternative you suggest is not only more complex to produce and more likely to trip up styling, it is also potentially incorrect no matter what. Name modelling keeps these properties distinct, for good reason.

If you have it, what should be shown is the display name. But that is no reason not to share the information you have about the structured name if you happen to have access to it. It is useful for search since people might know that person under different handles or the display name might lack some useful information (notably if it is pseudonymic). But you also generally don't want to render names as "Robin Berjon Robin Berjon" if you can avoid it. From there, there are several options

meta

<span property="schema:name">Dr. Bruce Banner</span>
<meta property="schema:givenName" content="Bruce">
<meta property="schema:familyName" content="Banner">

CSS

<span property="schema:name">Dr. Bruce Banner</span>
<span property="schema:givenName">Bruce</span>
<span property="schema:familyName">Banner</span>

with

[typeof="schema:Person"] [property="schema:name"] ~ [property="schema:givenName"],
[typeof="schema:Person"] [property="schema:name"] ~ [property="schema:familyName"] {
    display: none;
}

JSON-LD Island

<span property="schema:name">Dr. Bruce Banner</span>
<script type="application/ld+json"> 
{
  "@context": "link to correct context",
  "@id": "repeat URL already used above",
  "@type": "Person",
  "givenName": "Bruce",
  "familyName": "Banner"
}
</script>

So it's unclear to me how the first one isn't the most robust and the best practice. There's a reason why RDFa has meta. It's unclear to me if you're simply mistaken about name modelling or if you're arguing that it is bad practice to share the data you have in situ.

Also, you say "simple and dumb tools" like it's a bad thing.

csarven commented 8 years ago

I think you again went on in a direction based on false assumptions about my position. This is still even after I've explicitly said in my first comment:

If the intention is to have different purposes, i.e., the display 'name' being different than the givenName and familyName combination, describe the examples better instead of making premature generalisation with meta.

That translates to: use different examples to illustrate your point. If "Dr. Foo Bar" is the "display name", then use "X" for givenName and "Y" for familyName to make your point clear.

If 'display name' is composed of givenName and familyName, there is no reason not to reuse that information. So, the code examples (with markup and styling) you've provided are completely misleading because they are again based on another set of assumptions. There is no need to suggest that a single markup pattern should cover both of those cases (if not more).

Both of the examples below work:

<span property="schema:name">Dr. 
  <span property="schema:givenName">Bruce</span>
  <span property="schema:familyName">Banner</span>
</span>
<span property="schema:name">Dr. Bruce Banner</span>
<span property="schema:givenName">Robin</span>
<span property="schema:familyName">Berjon</span>

Is the difference clear? Both are completely human-visible and machine-readable. Parsers gets exactly the same information as the human.

darobin commented 8 years ago

I beg to differ. Your first example there conveys the fundamentally wrong notion that the display name is anything but orthogonal to the structured name. Showing that the content looks repeated when they overlap exemplifies that that is the expected manner in which one should encode name information in HTML+RDFa. Otherwise (and this happens every time a developer thinks they can "simplify" names) they might be tempted to resort to the kind of compression trick you show here, or even worse to automate it poorly.

csarven commented 8 years ago

I'm afraid you are still missing my core point. Whatever the information is and how that should be represented, try to capture that 1) by precisely describing what that is and what you are about to mark up 2) start by doing it in a fashion that's.. [see above I don't want to repeat myself.]

Please try not to handwave the point on various ways of writing something simply because you have a different preference. Try not to generalize a mark up pattern for all cases without discussing or at least acknowledging their differences. There is absolutely no need to prematurely throw a blanket over all of them. "The way".

I'm trying to help you write the examples and the code which reflect that. If you want to describe an example where it makes sense to hide and repeat data, knock yourself out - which will at least bring it up to 3 UCs as discussed in this issue alone!

iherman commented 8 years ago

Sigh... This discussion sounds a little bit over the top to me... What it shows is that there are so many different aspects that influence the style of "coding" (is it easy to handle with CSS, for example, like @darobin 's example shows) that picking one principle (like 'DRY') as the governing one is not really working either.

Let me say why I do not like the

<a resource="http://orcid.org/0000-0003-1279-3709" typeof="schema:Person"
  href="http://orcid.org/0000-0003-1279-3709">
  <span property="schema:name">Dr.
    <span property="schema:givenName">Bruce</span>
    <span property="schema:familyName">Banner</span>
  </span>
</a>

The problem I have is that it relies on a "trick" in RDFa which is far from being obvious, namely how the exact definition of RDFa works with regards to nesting. It is not at all obvious for a casual reader that the external <span property="schema:name"> generates a triple using all the text in that DOM node, and then the children <span> generate separate triples using the same subject for further textual content (all the more because nesting becomes very different with other RDFa attributes!). Unfortunately, RDFa has some elements of over-engineering to avoid, among other things, DRY violation, but the price to pay is that the structure is often difficult to grasp for a lambda user.

On the other hand:

<span typeof="schema:Person" resource="http://orcid.org/0000-0003-1279-3709">
  <meta property="schema:givenName" content="Bruce">
  <meta property="schema:familyName" content="Banner">
  <a href="http://orcid.org/0000-0003-1279-3709">
    <span property="schema:name">Dr. Bruce Banner</span>
  </a>
</span>

is clear and unambiguous with even a moderate or almost no knowledge of RDFa (it is, actually, more or less the copy of what one would do in microdata).

I am afraid this is the case where my personal choice goes for readability even if it goes against DRY...

pjohnston-wiley commented 8 years ago

I prefer @iherman 's version, but for business reasons. In this specific example, the idea that schema:name always derives from a combination of schema:givenName and schema:familyName is, i think, flawed. The schema:name here reflects how a person wants to be known in their public persona as an author - it is a displayed name. In non-academic publishing you could think of this as a nom de plume. ORCID calls this 'published name' (see below). The name parts are then what are used for more official purposes. The use of <meta/> is then appropriate - you can derive other display properties from these, but they are exactly that, derived.

image