w3c / scholarly-html

Repository for the Scholarly HTML Community Group
34 stars 26 forks source link

Reuse of existing ontologies #6

Open essepuntato opened 8 years ago

essepuntato commented 8 years ago

While the use of Schema.org could be a good option since its current wide usage, I would like to suggest not to redefine again and again classes (e.g., sa:Abstract, sa:formula, etc.) if they have been already defined in existing and well-established ontologies used by the community, e.g., the Document Components Ontology (DoCO) (see also a brief presentation of if on the SPAR Ontologies website).

[Disclosure: I'm one of the co-author of DoCO]

csarven commented 8 years ago

Strong +1 from me to reuse and exemplify with existing vocabs. Not sure what the ns is for sa, but if I'm not mistaken, SPAR predates it and has wider, mature, and intended coverage.

Most importantly, it is important to express the ideas by being agnostic of the vocabs. It permits different ways to publish (eg, with various vocabs and marking), and participate instead of prematurely implying the best vocab to use or modeling. Perhaps this point should be emphasised early on. The examples would then merely serve the general ideas and not strictly about "how to mark up x".

darobin commented 8 years ago

One important aspect to consider is that this is a document format before it is an RDF format. Syntax matters a lot because it can readily be expected that a lot, if not most, of the processing will happen at that level. For instance, styling is done by targeting RDFa attribute selectors. It's bad enough (and has been a recurrent source of bugs) to have to remember whether something is schema: or sa: (which is why I would like to get rid of sa: altogether). Having to remember for each selector whether it also ought to be doco, deo, fabio, etc. would make the system impractical. If you can't give a few simple rules to a CSS developer who doesn't understand RDF (and couldn't address the RDF model if she did) then a lot of the value of this format disappears.

Likewise, many if not most transformation and extraction processes are likely to operate on the DOM rather than at the RDF layer. Similar concerns apply.

Having said that, @essepuntato, addressing prefix proliferation at the syntactic level does not mean that we can't do the right thing at the RDF level! A fair number of the SA classes are aliased to SPAR, and more aliasing is certainly welcome. The section-typing classes might not stay if there is a way for DPUB ARIA (or similar) to capture them, but for whatever stays in the ontology I'm certainly more than happy to see equivalences expressed.

@csarven I don't see how we could be vocabulary-agnostic yet interoperable. Obviously, people who wish to add more data using other vocabularies can certainly do so, but at this stage my concern with the current document is that is is insufficiently constrained to drive production systems (e.g. I couldn't carry out citation formatting without a more stringent definition of what goes into a citation). I don't see how we could make it abstract and agnostic but still actually useful for scholarly interchange.

TzviyaSiegman commented 8 years ago

It's important to note that document structure is handled purely by HTML elements and ARIA/DPUB-ARIA roles (example: the fact that a list item is an endnote). The only information that is added with RDFa (schema.org and the sa: prefix) are those elements that should be picked up by RDFa browsers, such as citations and author information. The only exception to this abstract. This component of the article occupies a special space in the scholarly world in that it is both content and metadata. The DPUB-ARIA module recognized the need for users of all abilities to find the abstract quickly. SA: recognizes the need for RDFa browsers to expose the abstract. (Disclosure: I am one of the authors of DPUB-ARIA)

csarven commented 8 years ago

@darobin Sorry, I still don't see the argument as to why use sa. Who maintains it? Who has access to it? Which serialisations are available when I dereference it? Will the namespace be around in 5 years? What's the reason to not use SPAR? I use SPAR. But, again, our preferences shouldn't dictate what goes into this document. If it needs to have minimal dictation (and I can certainly come to terms with that), then I suggest sticking to schema.org and SPAR - they are sufficiently well-known/used, and developed over the years. I'm sorry that I can't say the same nor trust sa at this time.

Interoperability in this case happens when publishers use well-known vocabularies which might also have mappings to other vocabularies, or when consumers are inclined to write application logic to work with the data. We don't magically have interoperability because a document tells a developer to use vocabulary x.

darobin commented 8 years ago

@csarven Well, I've provided solid reasons not to bring in a host of SPAR vocabularies, otherwise you end up in situations like this (just one example amongst many) and you start feeling the need for a spaghetti diagram. If you have another unitary namespace designed to be folded into schema.org, I'm certainly interested.

The goal of SA is to be absorbed into schema.org, or to see its need reduced through other means. It is currently published here and maintained over there. It is being provided to the community in the same way as Scholarly HTML, so its longevity potential is identical.

Again if you have an alternative proposal that keeps the whole of our needs inside of two prefixes (rdf and xsd are also used but aren't typically style or processing targets) and has a path forward to a single-prefix world, I'm certainly interested to hear it.

Concerning interoperability, if SPAR had anywhere near schema.org-level adoption we could consider commonality as trumping usability (since developers would be likely to know it). But SPAR is only known inside the RDF world, and even there the SA ontology is probably the system I've encountered that uses it the most.

csarven commented 8 years ago

I haven't heard the argument for a monolithic vocabulary. I'm not implying that that's right or wrong, but I'm forced to ask because you seem to suggest that without strong evidence why that'd be preferable. I don't see anything wrong with multiple small vocabularies designed to cover what they are specifically made for. In actuality, it is cheaper to load/reuse small vocabularies instead of loading up giant vocabularies and then having to use only a few terms.

With sa, what you are saying is that, the vocabulary is maintained by a company which has a particular interest and investment towards x,y,z and so the vocabulary that should be recommended and exemplified by a W3C CG should reflect that companies' interests?

I hope you do realize that it is trivial to just republish all of SPAR under a single ns. Should we use that argument to use SPAR since it happens to have better coverage? Would you feel comfortable if I create a vocabulary which closely reflects the needs of my tool and put that up on equal grounds here?

Why would anyone want to use sa and then maybe if it gets subsumed into schema.org as you suggest, then they'll happily go back and update all the ns/prefixes in their data? What evidence do you have for that? If sa was serious about empowering schema.org with the new additions, 1) the complete discussion (of all the terms that are currently in sa) would take place at github/schemaorg, and 2) not bother with the sa vocab to begin with.

So, pardon me, but I find it quite distasteful to see that a company barging in with a W3C CG, 1) naming the group after their tooling or whatever, 2) insisting on using their ns. Seriously?

essepuntato commented 8 years ago

Hi @TzviyaSiegman,

I totally agree that the use of DPUB-ARIA is a very good option for specifying structural semantics of HTML elements, and I really like the approach, as you know.

The only information that is added with RDFa (schema.org and the sa: prefix) are those elements that should be picked up by RDFa browsers, such as citations and author information. The only exception to this abstract.

I see your point. But then, in order to foster reusability and applicability of the Scholarly HTML format/guidelines with existing implemented tools, it would be better to reuse existing, well-known and widely-used terms instead of inventing a new vocabulary for expressing the same things. I'm not speaking only about SPAR here, but also about Dublin Core, and other standards of this kind.

And, @darobin,

I've provided solid reasons not to bring in a host of SPAR vocabularies, otherwise you end up in situations like this (just one example amongst many) and you start feeling the need for a spaghetti diagram. If you have another unitary namespace designed to be folded into schema.org, I'm certainly interested.

Usually, I prefer to reuse an existing vocabulary/ontology if it has been already developed and even used by different communities for describing a particular purpose/domain. I understand your concern about the "spaghetti diagram" but, honestly, it is not an issue to me: in this context, reusability of shared knowledge wins against a brand-new reinvention of the same things – in particular if you consider that the Scholarly HTML guidelines/format is for machine interoperability first, not for human understanding.

Anyway, if the idea of using a limited number of "prefixes" is the only way for explicitly including (RDF) semantic data within Scholarly HTML, and we decide we really need, then I see one only option (as @csarven already suggested): to propose an extension of Schema.org (with all the alignment to all the other and well-known existing models) from the very beginning, instead of introducing yet another vocabulary into the game.

I have to confess the latter is not my preferred option – I still believe reuse of shared and well-known existing terms is the right path – but, still, it is an option that we could investigate somehow.

darobin commented 8 years ago

@csarven If you're unhappy it's fine to point fingers, but it's generally better to do one's homework first. You seem to have some kind of conspiracy theory about a company carrying out some namespace takeover, yet right in the relevant section of the document there is an inlined issue that states: "The current URL for the Scholarly Article vocabulary is http://ns.science.ai/. It may be desirable (should the vocabulary persist) to use a different URL. But this issue might go away if schema.org steps up."

Hey, I wrote it so maybe I'm wrong, but I don't think that sounds like "insisting" on anything. I'm not sure what you mean by naming the CG after our tooling, we have no tooling called "Scholarly HTML", a term that predates the creation of the company by several years. But then I know your first proposal was to rename the group to match your own project so maybe you believe everyone thinks that way?

We've built stuff, we think it's useful, we're trying to share it. I'm not sure what's wrong with that.

I don't care about monolithic vs non-monolithic vocabularies, that's largely an orthogonal issue; I do care prefix proliferation because experience shows it's a source of bugs. You suggest aliasing SPAR into a single prefix, well... that's what the majority of SA classes do (and if we're missing aliases, I'm happy to take PRs). I'm glad we had the same idea on this one.

Updating prefixes: if sa were folded into schema I would update prefixes across all of our production systems the same day. Simplicity is a win.

We've worked with the schema community before (e.g. https://github.com/schemaorg/schemaorg/issues/975, https://github.com/schemaorg/schemaorg/issues/383, and quite a few others) but so far on individual bits and pieces. It seems more productive to interact with them using more context. "We tried to model all the core parts of a scholarly article using schema.org; these are the bits that are missing." There's a reason Dan is ack'ed in the original proposal: we like them, we think it's useful, and we'd like to figure out more effective ways of collaborating and exposing scholarly information on the Web.

So to conclude it's okay to be angry and find things "distasteful", I actually enjoy strong opinions, but it works better when such feelings aren't rooted in imagined conspiracies.

darobin commented 8 years ago

@essepuntato I completely understand your reluctance where reuse is concerned here. As I was telling hadley the other day, part of the problem (to me) is that a long table of specialised vocabularies reminds me a lot of SOAP. At some point (I don't know how old you are, you may or may not recall this era), there was a WS-* for pretty much every single type of interaction that could take place between two pieces of software. In terms of reuse it had pretty neat properties, but in practice there were so many namespaces involved that developers quickly lost their kittens.

One of the great things with the RDF universe, unlike the XML (or WS-*) universes, is that there are mechanisms for describing equivalences and derivation. As far as I'm concerned, SA uses SPAR, it's just not immediately obvious :)

Regarding folding into schema, I agree, but as I was just saying it's easier to first figure out what is needed and then have that conversation than to have it in small parts. Context helps make things clearer.

essepuntato commented 8 years ago

Hi @darobin

I don't know how old you are, you may or may not recall this era

Let's say I remember it introduced in the Web Services course when I was a student, but honestly I don't have any practical experience of its use and implementation ;-)

One of the great things with the RDF universe, unlike the XML (or WS-*) universes, is that there are mechanisms for describing equivalences and derivation

Eh... this is a discussion I've already done several times with other Semantic Web guys. Basically, there are two main approaches to ontology development:

  1. To create a new ontology by including directly all the entities coming from already defined vocabs/ontologies if they fit the purpose. Pros: high and fast reusability of the data described according to that model (less computational effort for enabling interoperability). Cons: prefix proliferation (more cognitive effort for humans, more difficult to maintain).
  2. To create a new ontology where each term is defined within a unique namespace, and then align them to other external entities defined in existing vocabs/ontologies if they have the same actual semantics. Pros: just one prefix (less cognitive effort for humans, easy to maintain). Cons: I need to follow, in some way, all the possible equivalent/subclass/subpropery/etc. relations to link potentially-related data (more computational effort for enabling interoperability).

Thus, since RDF has been made mainly for enabling machine-processing, I strongly prefer option 1 since it is more machine-friendly and make easier, without using additional tools (e.g., reasoners), to connect data at the model level – to me, easy and fast reuse is always the winning strategy. However, others, who mainly think about maintenance of the ontology, prefer the second option. It's just matter of perspectives, I think.

As far as I'm concerned, SA uses SPAR, it's just not immediately obvious :)

I was aware of that – you are already on my update list of SPAR adopters I have to add to the website.

I was just saying it's easier to first figure out what is needed and then have that conversation than to have it in small parts.

I see. I'm not practical (yet) about the way for interacting with schema.org guys, honestly. However, if we will decide to follow the "one prefix only" strategy, it would be good to have at least a draft schema.org vocabulary (aligned with the rest of the ontology world) available for being included in the first official draft of the Scholarly HTML format.

darobin commented 8 years ago

Hi @essepuntato,

I'm sorry you had to go through a course on Web Services, but at least you didn't have to use them ;)

I think you make a great distinction in your 1 vs 2 options above. I think that may be where we can "agee to disagree". A format like SH somewhat painfully straddles the HTML/RDF divide and up to a point the human/machine divide. Notably, I don't just want RDF for the data, I also want it for human-oriented purposes such a styling. If you look at https://github.com/scienceai/scholarly-css/ (still very much work in progress, but it's the styling behind https://research.science.ai/), everything in there targets semantics: there are no class or id selectors.

This is not only convenient (no need to agree on shared class names in addition to the shared ontology in order to have reusable styling) but it's also strongly supportive of increasing how much data you share (it also helps find data bugs since things can lose styling because of that). RDF becomes pretty!

It's sort of Microformats in reverse: instead of using styling information to infer semantics, we use semantics to apply styles.

Based on that I guess you can understand my strong human-orientation :)

I would very, very much prefer to solve the ontological issue before v1; I think it should be a requirement (or at the very least it should be a requirement to try — if it fails we'll look at a backup plan). I don't know about "official" drafts, there's nothing official about a CG. Mostly I just find it easier to have discussions based on a concrete draft (which seems to work!). There's a lot that I'm not completely happy with in the draft we've been writing, I expect the same for others.

essepuntato commented 8 years ago

Hi @darobin,

I see what you mean, and I totally agree on not using id nor classes for the CSS – in RASH we ahve started to follow this approach as well (when possible).

However, agreeing to disagree, I still see two main issues here.

The first one is about using or not using RDFa for expressing structural semantics of elements (which closely related with issue #5). For instance, in the current documentation figure types (i.e., their intended structural semantics) are specified by means of @typeof, while in other cases (e.g., endnotes) the use of @role according to the ARIA/DPUB-ARIA spec is preferred. To me, only one approach should be used for structural semantics, and the ARIA one looks more promising, in particular from the Web accessibility point of view. I'm not saying RDFa shouldn't be used at all, but that I would prefer to avoid it for structural semantics.

Note that using RDFa for structural semantics would add an overhead of the attributes one has to specifies on HTML elements, since the use of @id+@about for identifying the link between a fragment to a particular semantics is mandatory. <figure typeof="sa:Formula" id="formula_1"> is translated as [] a sa:Formula, which basically do not link explicitly that fragment with the concept of being a formula box, while <figure typeof="sa:Formula" id="formula_1" about="#formula_1"> conveys the right information, i.e., <#formula_1> a sa:Formula.

(This mechanism holds for any HTML element on which one wants to express structural semantics by means of RDFa)

The other issue is still about the reuse existing ontologies vs. the use of a monolithic one (even if aligned with the rest of the world). Your argument in the previous comment, towards the human-orientation, still does not convince me. Just to clarify, I like doing stuff that are easy-to-understand by humans, but honestly I don't see how using "schema:ScholarlyArticle" and "sa:Figure" (or even a possible future "schema:Figure") is more human-oriented than "fabio:Article" and "doco:FigureBox". CSS rules, as well as any other kind of Javascript manipulation, work well even with the latter ones.

So, again, here there is a choice that I think we cannot impose – it is still a community group after all – and should be appropriately discussed by all the people involved. (Guys, any comment?)

Finally, while I don't think that using RDFa is a good choice for defining structural semantics, then it can be a good choice to provide a semantic description of the bibliographic references. But then why reinventing the wheel, while a lot of proposed approaches already use existing ontologies developed exactly for that purpose?

Summarising, two things to discuss with all the participants:

  1. for defining structural semantics: ARIA (i.e., @role) vs. RDFa (i.e., @typeof + @about + @id)
  2. about ontologies: reuse ontologies vs. monolithic ontology

My position, as stated before, would be 1) ARIA, and 2) reuse ontologies. Or, even more drastic, I would also like to go with just 1) ARIA, without caring about 2) – thus, without imposing all the producers of Scholarly HTML documents any particular vocabulary for describing, semantically, data within the documents themselves. Is that approach so wrong? Is it such a big issue that would prevent the Scholarly HTML format from being interoperable?

I'm really looking forward to hearing others' opinions.

iherman commented 8 years ago

Wow! Let's the religious war begin! Or maybe not, let us try to avoid it...

Some history: @darobin referred to his ws-* experience; well, I am (un)happy to say that I even remember when ws-* did not even exist and we all were convinced that namespaces are the greatest invention since sliced bread! (That may be what @darobin had courses on at university.) We both definitely remember, however, the big HTML5 vs. XHTML2 war, and many of us still bear the scars of that war. This debate is very much related to that conflict, too.

More seriously:

(B.t.w., one of the reasons of schema.org's success, I believe, is that they recognized the failure of namespaces in practice, and that HTML authors, webmasters, etc, would only go with semantic markup if it was very very simple.)

Where does this lead us? If we were dealing with JSON-LD only, then we would have no problem. JSON-LD's @context is a wonderful tool to hide the complexity of namespaces for the lambda user. Both communities are happy, and JSON-LD is the bridge. This approach has been used by many other groups by now (Web Annotations, CSV on the WEB metadata, to name just two). It just works.

RDFa, alas! does not have this. As far as I can see, SA is an attempt to provide some sort of an intermediary solution: create a vocabulary that, conceptually, is a large set of owl:sameAs statements into SPAR, or DEO, or Schema, or maybe other vocabularies. (I realize that there are some genuinely SA terms which are subclasses of, say, SPAR classes; these can be discussed individually and, maybe, those can migrate into, say, SPAR. This does not change the general approach.). In other words, SA seems to fix the missing JSON-LD @context feature for RDFa in this particular area; ain't pretty, but it works. Actually, if this is the only vocabulary used in an HTML context, then the @vocab attributes make the CURIE-s disappear, too, much like JSON-LD and @context. I see SA as pragmatic approach; pure bred RDF users will have to hold their nose a bit, but we should realize that we are talking about the acceptance of these by a community that has antibodies for RDF in the first place!

So yes (and I will probably be considered as a traitor by the RDF community), my vote here is to make namespaces as invisible as possible. On long term, I also agree that the goal should be to get all these terms into schema.org. Let us realize that, for scholarly papers and for their authors, to be found on the Web, by tools like google scholars, is absolutely vital.

Few more specific notes from the thread:

@essepuntato:

My position, as stated before, would be 1) ARIA, and 2) reuse ontologies. Or, even more drastic, I would also like to go with just 1) ARIA, without caring about 2) – thus, without imposing all the producers of Scholarly HTML documents any particular vocabulary for describing, semantically, data within the documents themselves. Is that approach so wrong? Is it such a big issue that would prevent the Scholarly HTML format from being interoperable?

As I stated elsewhere, I agree with the predominance of ARIA when appropriate. But for the places where RDF(a) is appropriate, I believe one vocabulary is important and, actually, I would look at schema.org as our goals sometimes in the future.

@csarven

Why would anyone want to use sa and then maybe if it gets subsumed into schema.org as you suggest, then they'll happily go back and update all the ns/prefixes in their data? What evidence do you have for that? If sa was serious about empowering schema.org with the new additions, 1) the complete discussion (of all the terms that are currently in sa) would take place at github/schemaorg, and 2) not bother with the sa vocab to begin with.

Yes, ideally we should talk to schema.org as soon as possible because this vocabulary should indeed end up there. That being said, there are examples of "external" vocabularies that ended up in schema.org, which led to some period of "double" namespace usage. GoodRelations is probably the best example.

Ivan

P.S.1.: @darobin, I also realized that the JSON-LD @context is not 100% o.k. in the larger picture. A precise vocabulary of SA in, say, Turtle, would contain something like:

<http://ns.science.ai#Abbreviations> owl:sameAs <http://purl.org/spar/doco/Glossary> .

however, the JSON-LD says something like:

"Abbreviation" { "@id": "http://purl.org/spar/doco/Glossary" }

These are not the same; alas!, the RDF expressed in Turtle makes use of the OWL semantics for identity, whereas the JSON-LD makes a direct mapping on URL-s... Ie, Turtle is a semantic equivalence, whereas JSON-LD @context is simply a syntactic trick...

P.S.2.: while we are having historical reminiscences: at some point the RDFa WG introduced the notion of a @profile: it was a reference to a separate file in which one could define terms to an arbitrary URI, could define loads of prefixes, etc. Conceptually, an RDFa processor could download such a profile to then control the interpretation of terms and prefixes. Exactly like @context in JSON-LD. (I even had an implementation, using some local cache, for my RDFa Distiller!). There was a violent push back from the HTML5 WG on this, calling the feature (and probably the authors) by all kinds of names. So we removed the feature. Luckily, the JSON-LD guys did not chicken out...

darobin commented 8 years ago

@iherman Wow, thanks for the very good post :) I think you hit the nail very much on the head.

Your description of where I'm coming from here is very correct. I think that technology is usable when you only need to understand the underlying theory for advanced cases. JSON-LD is great for that, people just use it without ever knowing what RDF means, and it just works. Things like Facebook's GraphQL and Netflix's Falcor are attempts at coming to terms with the fact that data of increasing complexity will tend towards a graph; JSON-LD is a couple steps ahead of what they're building.

RDFa isn't at that level of simplicity. Indeed vocab can help but it also impacts rel, which is problemantic. Something like profile would have been nice, but we don't have it. So we need to make things simple in another way.

Hey, I'm a big big fan of small and simple things that get combined together. I completely understand that slurping everything under a single prefix may feel "brutal". But the only concrete alternative I see is linked data that remains confidential and confined.

@essepuntato You mention that you can address an arbitrary number of prefixes with CSS and the DOM, and technically that's true. Machines would handle that without even thinking about it. But in practice even having only two prefixes has already been a source of bugs, and that with high-profile CSS hackers who understand what an ontology is and all. If you need to print out a prefix map to duct tape to your office wall in order to use the system, it just doesn't work. If that's the approach, we're much better off dropping RDF altogether and sticking to classes — we can make those regular and learnable easily. But I'm not ready to give up on a linked data world though!

Regarding structural typing I think we should separate out the case for figures and that for sections.

For figures (of any kind), it is actually pretty common to have a fair bit of information that can get attached to them. It's not rare that they might have authors different from those of the paper, a distinct license and copyright, they may be used "with permission", they may be based on code and/or data, or relate to it (e.g. a formula modelling some data). As such I think it makes a lot of sense to treat them as distinct resources, and therefore to type them using RDFa (and make them schema:hasPart of the article, or of their parent figure for multifigures).

For sections, I agree that the case is less clear. Mostly, the problem is the absence of usable ARIA. For body sections, I don't care so much because it does not impact styling (though it's nice to have that information if it's being produced) but for the contentinfo sections it is more problematic. It's something worth mulling over.

That being said there may be an argument for making sections their own resources: annotations. I can see how it could make sense to ask for "all the materials and methods sections that are tagged with neurology and criticality" (imagining that annotations aren't just for user comments but also for classification). I need to think about it some more, WDYT?

@iherman We can fix the SA ontology if you want. We don't use OWL internally but some of our customers do (but I don't think they've started applying it in earnest on this yet). Being nice to OWL is good :)

Completely side story: I started working in 1996 (gosh...) so no I didn't learn about namespaces in university but in the field :) Funnily enough, my XML::NamespaceSupport Perl module, which was designed to make handling namespaces as transparent as possible, made it to the top ten most-depended-upon module on CPAN. Maybe the (very ephemeral) glory from hiding namespaces went to my head ;-)

essepuntato commented 8 years ago

@iherman, @darobin,

Thanks for the discussion, it has been very instructive to me.

Since having a sigle namespace seems to be a requirement, so as to use the @vocab approach as suggeste by @iherman , should we start a subgroup of SH that would take care about extending schema.org appropriately so as to include everything we need? I would be particularly happy to be directly involved in this.

What do you think?