scienceai / scholarly.vernacular.io

A vernacular of HTML for scholarly publishing
http://scholarly.vernacular.io/
Apache License 2.0
38 stars 10 forks source link

Why slow down page delivery for 99.9% of users when only 0.1% cases needs the rich data #42

Closed liborvalenta closed 8 years ago

liborvalenta commented 8 years ago

Have you considered CorssRef proposals?http://crosstech.crossref.org/2010/03/dois_and_linked_data_some_conc.html

It is way easier to expose in the HTML link to alternative forms of the document in machine readable formats. Either with HTML <link rel="alternate" type="application/rdf+xml" href=”http://dx.doi.org/10.1126/science.1157784” title=”RDF/XML version of this document”/> Or using HTTP headers Link: http://dx.doi.org/10.1126/science.1157784; rel="alternate"; type="application/rdf+xml"; title="big print"

darobin commented 8 years ago

There are several points of disagreement here. I will try to address them one by one.

Page Weight

The impact on page weight is actually minimal. Making use of embedded linked data allows us to style everything by targeting the semantic properties and therefore to use no class whatsoever. Not only is the weight difference compared to using class minimal (in fact it might be favourable) but it also enforces good practices in styling in that they stick to the semantics instead of picking the "easy way out". Additionally, a baseline CSS can be shared (and cached) across all uses, further reducing impact (see #25).

Templating Difficulty

I have already implemented this twice, in two different templating systems. I don't see the difficulty.

Data Extraction

You are not comparing apples to apples. Getting only the metadata about a document from a separate file will always be easier than getting all the data. But our use case is to expose all the data. You could design a format that would feature the full content of a Scholarly HTML document, but then it would lose all the advantages of being HTML.

Link Headers

Link headers are authoritative metadata, which means that when you lose them you lose the ability to process your content correctly. It's an antipattern. See this thread for further discussion.

Who the Consumers Are

You seem to assume that the consumers of the data are only interested in the metadata and are some sort of specialised crawler that could support ad-hoc rules discussed in a CR blog post. The consumers for this information both want the whole thing and/or are general-purpose implementations. For the former case, authoring tools and scholarly tools in general need to operate on the full content with the full information. Getting just the metadata is of limited use. And our crawlers are general-purpose schema.org processors, they can actually use the content, which they wouldn't do with the CR hack.

CrossRef Centralisation

The whole point of using the Web is decentralisation. Building an architecture for scholarly information around a single point of failure (what's more, one that already often times out) is a recipe for failure.

Changing the HTML Structure

I don't know why you would have the HTML structure changing on a daily basis, and certainly not why there would by multiple flavours. That is just not how Web sites are built.

Content Negotiation

Content Negotiation is not a solution to anything; it just adds an extra problem. Keeping multiple semantic formats in alignment is a very risky idea, it's a great way to produce hard-to-find bugs.