Closed iherman closed 6 years ago
The only thing I find a little weird is that it's not required to embed the manifest in the primary entry page, but the title must be pulled from the primary entry page. Is primary entry page what we want here, should we restrict embedding to the primary entry page, or should it be the title from the page in which the manifest is embedded?
I don't like having requirements that are tied to the primary entry page when processing the manifest.
I'd rather handle this like @id
since the title is not a requirement for our infoset. UAs may do whatever they want but I don't want super specific processing rules in our draft.
It also feels weird that we're making the title optional both in our infoset and our manifest, yet end up with processing rules that will always extract a value from HTML.
@mattgarrish
The only thing I find a little weird is that it's not required to embed the manifest in the primary entry page, but the title must be pulled from the primary entry page. Is primary entry page what we want here, should we restrict embedding to the primary entry page, or should it be the title from the page in which the manifest is embedded?
To be honest, I never even considered seriously to embed a manifest into a different file than the primary entry page but you are right, this is not specifically said.
My option would be to restrict the embedded manifest to he primary entry page.
It also feels weird that we're making the title optional both in our infoset and our manifest, yet end up with processing rules that will always extract a value from HTML.
I do not see a contradiction. The title element is not required in HTML either.
I'd rather handle this like @id since the title is not a requirement for our infoset. UAs may do whatever they want but I don't want super specific processing rules in our draft.
I think that, in practice, there is a difference. If I want to turn a single-document HTML publication into a WP, I would normally use an embedded manifest and forcing the author to repeat the same information twice (the title element and the manifest's name
) is error prone and unnecessary.
The usage of canonical link is much less frequent and, as we saw in the separate discussion, its semantics is not clear. Hence my agreement of dropping it. This is different imho.
I think that, in practice, there is a difference. If I want to turn a single-document HTML publication into a WP, I would normally use an embedded manifest and forcing the author to repeat the same information twice (the title element and the manifest's name) is error prone and unnecessary.
They're not forced to repeat it (name
is not required).
In practice, I don't think that the value will even be the same very often. It's very common for <title>
to also contain the name of the website which is hosting such single-document publications. Looking at examples using AMP (here's one), you can see that http://schema.org/headline is used specifically for that reason.
Using <title>
for a single-resource publication is also the default behavior for the UAs that we care about the most (browsers), which is why I don't think we need to have spec language that applies to every type of publication just for that use case.
To be honest, I never even considered seriously to embed a manifest into a different file than the primary entry page
That sounds like rational thinking. You can't do that! ;)
If the primary entry page must link to the manifest, I think it's fair to add that it must be the document that embeds the manifest, when embedding. I just want to make sure that's the general consensus.
@mattgarrish you should include that with the overall editing on the primary entry page issue (discussed elsewhere)
I'm feel the proposed wording too complex. Our wording should be consistent between sections. If we look at:
4.4.1.1 The default reading order is specified directly in the manifest. However, if the reading order consists of only a single resource, namely the primary entry page of the Web Publication, the default reading order need not be specified.
We could adopt something like:
The title is specified directly in the manifest. However, if the publication consists of only a single HTML resource, namely the primary entry page of the Web Publication, user agents MAY use the value of the title element of this resource.
It means that the title of the primary entry page may be used as publication title independently of the embedded/detached state of the manifest (which is not the point for this infoset requirement).
On another level, I'm no fan of this MAY for user agents, because it breaks the consistency between user agents behaviors that authors expect when they create publications. But I understand Hadrien's comment that the HTML title may be different from what we consider a standard publication title.
@llemeurfr
Your formulation makes it actually stronger than that what I originally proposed, because it talks only about a single document publication (whereas I the original version talked only about an embedded manifest. The single document case is probably the clearest use case for this, but I could see a reason why the user would rely on a rich primary entry page re-using HTML for many things for a multi-document publication as well (that document can also be used for the navigation, for example).
I would say that a MAY, for something like that, is meaningless. Or, rather, could be considered harmful because, as you say, it is creates inconsistency among implementations. Which may mean that a mixture of the current text and yours may be better, insofar as removing the case of a separate manifest altogether.
I understand Laurent's revision, but I don't understand 4.4.1.1 in the context of the specification.
Shouldn't there be a statement somewhere that the entry page must be added to the reading order if the reading order is not specified, or am I missing it somewhere? How is it required in the infoset but optional to specify, in other words?
Otherwise, it seems arbitrary that the reading order can be omitted only when it would otherwise contain the primary entry page. What if I have only one other document I would have put in it, why can't it be omitted, too? (i.e., why doesn't it apply to any single-resource reading order)
@mattgarrish the eagle-eye:-)
Shouldn't there be a statement somewhere that the entry page must be added to the reading order if the reading order is not specified,
Yes, good catch! I am not sure why this is missing, we did define it that way. The text in the 4.4.1.1. seems to be incomplete, though true: if we have only the primary entry page, then by default that becomes the reading order, ie, it is not necessary to explicitly specify it...
This is independent of this PR, though, I think this is a change you could make on the main branch directly...
Thx!
Can we merge this now?
@mattgarrish there is no consensus...
<title>
element in general (and I do believe that we should have this, albeit restricted to embedded manifest cases in the primary page, based on the DRY principle):-(
I'm not focused on single-document. I can propose, to be usable in multiple-documents:
The title is specified directly in the manifest. However, if the title is missing from the manifest, user agents MAY use the value of the title element of the primary entry page of the Web Publication.
It's more or less back to square one, but does not require the manifest to be embedded.
@llemeurfr I still do not like it, due to the MAY.
The title is specified directly in the manifest. However, if the title is missing from the manifest and the manifest is embedded in the primary entry page, the value of the title element (if not empty) of the primary page of the Web Publication MUST be used.
The MAY is, in a way, meaningless: authors should not and cannot rely on that.
I don't see a lot of controversy for a MUST for the embedded case, since this is only a fallback when the author omits the title. The user agent is already processing the page, so it's not forcing additional content to be retrieved and parsed. It's not any different than what bookmarking produces. Having the second MAY that says do whatever you want for the linked case is an appropriate counterbalance.
I agree with @HadrienGardeur that it's probably not going to prove to be a great alternative to properly specifying the title, but it's also a less bad alternative than having the UA call the document "untitled", or whatever placeholder it uses, when the title element easily available.
Re-reading the whole thread, Hadrien's position may be the best: the title is optional in the infoset, optional in the manifest. And the html title in the entry page will in many cases be semantically inadequate. UA's may display a default (like
If there is a consensus about the fact that a single-document must be easily transformed to a WP without duplicating data, we could alternatively settle on (MAY replaced by SHOULD):
The title is specified directly in the manifest. However, if the publication consists of only a single HTML resource, namely the primary entry page of the Web Publication, user agents SHOULD use the value of the title element of this resource.
@llemeurfr
I won't lie down the road for the MUST on the title, but I am not sure what SHOULD bring in this case. To be very specific, does the canonicalization of a manifest create a 'name' term, if there isn't any, out of the title element? If it is a SHOULD then I do not think it should (sic!). But then I expect applications to do the canonicalization (which, for example, takes over the document base or default language from the context, which is similar to the title element story) and if this is not part of the canonicalization, then it won't happen. Authors should not rely on this, so the DRY principle will be broken.
So far we have never had a separate variant in the draft for the case when the "publication consists of only a single HTML resource". We would be introducing yet another publication variant, which just complicates the draft. Whether SHOULD or MUST, I certainly prefer to stick to the choice whether the manifest is embedded or not, a differentiation that is already used elsewhere (eg, the base for relative URIs, overall language and base direction). I would expect the vast majority of 'single HTML document publications' to embed the manifest anyway.
@iherman
I understand 4.4.1.1 as the equivalent of
The default reading order is specified directly in the manifest. However, if the publication consists of only a single HTML resource, namely the primary entry page of the Web Publication, the default reading order need not be specified.
... which is another special case of rule applying to single-document publications.
That said, my favorite solution is still the one with less specification = same as Hadrien. Let UA choose a fallback when there is no title in the manifest.
That said, my favorite solution is still the one with less specification = same as Hadrien. Let UA choose a fallback when there is no title in the manifest.
Which means that an average scholarly article, for example, will have to repeat the title in the <title>
element as well as the name
value in the manifest, because nothing is guaranteed. We are excluding a large and important use case.
I would think this needs a clear WG discussion and possible vote on a call, I do not think this is something we can get a consensus on here. Should be put on a call agenda: @GarthConboy @TzviyaSiegman
To make the call easier, here is a summary, as I see it. The issue is what the relationship is between the <title>
HTML element and the name
attribute of the manifest (which is the embodiment
of the "title" infoset item). More exactly, what happens when there is an HTML title in the primary page, and there is no name
item set in the manifest. The two clear-cut choices are:
name
but that is all we say.name
for the manifest. This behavior is required. For me, <title>
in the HTML entry page and name
in the manifest are semantically two different things.
There's a good reason why the title is not required in the infoset and having steps in our processing of the manifest that re-use the <title>
in HTML defeats that purpose IMO.
In general, I really dislike the fact that we conflate the entry page with the publication (this extends to other infoset items as well).
I just discovered a very interesting article thanks to @JayPanoz that is worth reading in the context of this issue: https://www.ctrl.blog/entry/browser-reading-mode-metadata
I just discovered a very interesting article
wow, this shows clearly that that the html title element is not the place where most browsers will look for an article title for their reading mode. Which lets us with a need to define what is the title of a web publication without taking as an fact that authors, today, use the html title element for expressing such information (or taken differently, if they do, browser vendors don't use it).
The html title is the info that appears in the results of a search, is an info that is used for SEO purposes... that may be enough of a burden for this field.
(Just peeking in from my vacations...) all these arguments are perfectly fine if the title element was the only place to set the title of the publication. But that is not the case, it is only a fallback. That gives all this a different twist...
Given the fact that the infoset does not require a title, my proposal is to simply remove the fallback on the <title>
element in the entry page.
To repeat what's been said before:
<title>
element on the entry page is semantically different from the title of the publication, <title>
in HTML is often used for SEO and would contain text strings that are not fit for a publication's title<title>
element for their reading mode (which is probably the best starting point in most browsers for a publication reading mode)The publication address (URL) returns an HTML page (entry page) which contains a <title>
element (because HTML requires it). How is that not the title of the publication?
@BigBlueHat, that's assuming we require the content creators to have that entry page's title be the title of the publication, not "Introduction" or "Abstract" or "my big splash page in my book about splash parks!!!🌊"
@BigBlueHat, that's assuming we require the content creators to have that entry page's title be the title of the publication, not "Introduction" or "Abstract" or "my big splash page in my book about splash parks!!!🌊"
They can put whatever they'd like, but if the publication address (which is a thing we've defined) returns HTML which itself "binds" the publication (via the manifest, etc), then is that not the publication itself--whatever else it's named?
And, consequently, won't that mean (for SEO reasons among many others) won't the sensible folks out there give those publications a proper title--just as they would via the manifest if targeting Google search results intended for schema:Book
or schema:ComicIssue
?
The publication address (URL) returns an HTML page (entry page) which contains a
element (because HTML requires it). How is that not the title of the publication?
The <title>
of that entry page could be something like "Title of the publication - Publisher Inc - September 2018" purely for SEO and that's fine, that's what <title>
is for. As we've said over and over, we're not in the business of forcing how content creators should or shouldn't use HTML.
As pointed out in my previous comment, browsers do not agree on <title>
for their current reading mode and for a good reason: they understand that this is not necessarily the same information as what's in <title>
.
Why are we trying to force feed <title>
when it's semantically different, has no consensus from browsers for this use case and isn't required in the first place in our infoset?
@HadrienGardeur because we want to display/present something as the name/title of the publication. If it's not required in the manifest, the <title>
of the HTML returned by the publication address seems like the nature next candidate.
@BigBlueHat why did we made the title optional in the infoset if we make it almost a requirement through the processing of the manifest?
With the current PR, the only way to end up with an undefined title would be using an external manifest that does not contain the name
key.
The only situation where I would be comfortable using <title>
would be:
<title>
became a fallback as part of a failure mode, not as a step in our algorithm meant to process the manifestUnder the current scenario, I think it's inconsistent with our infoset and goes against some of the principles that we've established in other issues (letting the content creator do what it wants with the content).
@BigBlueHat why did we made the title optional in the infoset if we make it almost a requirement through the processing of the manifest?
No idea. I've always felt they should be required.
With the current PR, the only way to end up with an undefined title would be using an external manifest that does not contain the
name
key.
This is another reason why #333 ("manifest must be embedded in primary entry page") appeals to me. It makes that problem go away, and it also give the manifest it's intended SEO related value.
No idea. I've always felt they should be required.
Didn't it come from the (arguably now) somewhat misplaced idea that a packaged web publication had to be a valid web publication, so we couldn't enforce anything in WP that wasn't in PWP? As I recall, the argument was that someone creating a one-off document might not want to bother creating a title for it.
Oops, did I close that? :)
I have no issue with making title the preferred fallback in the embedded case, but I am worried that we're not being specific enough. We should be clear what "must use" means here - for example, does that mean verbatim or does it mean to use the kinds of heuristics that others are already resorting to? I'd like to allow some flexibility for intelligent parsing, as I doubt every case of a forgotten name will be a deliberate leveraging of this authoring simplification.
@HadrienGardeur, I would be perfectly happy requiring the title in the infoset (per https://github.com/w3c/wpub/pull/331#issuecomment-422099975). To be honest, I am not sure why this was not the case before; I have a hard time imagining a proper publication without a title...
@mattgarrish I would not want to get into the issues of "interpreting" the content of the title element; I do not think we would be able (and it would be worthwhile) to write a specification on this.
Also, the article referred to above, though interesting, may not be all that relevant for us. The browser's reading mode is meant to "interpret" any kind of Web site, essentially getting rid of the "noise" (advertisements, unimportant menus, etc). Facing such a diversity it is quite normal that the content of the title element is messy. However, in the case of a WP, we are talking about a properly curated Web content which is meant to be, well, a publication. Although mistakes happen, I would expect the title to be more carefully chosen for that case (which is also the reason of this whole issue, which is primarily meant to follow the DRY principle).
@iherman this is not an issue of "carefully chosen" or not.
The <title>
in HTML is used well-beyond our use case in WP and it's semantically not the same thing.
As for re-opening the discussion about requiring a title, I remember that @lrosenthol was strongly against this idea during our early metadata discussions. Key participants in the early metadata effort may want to chime in as well (@baldurbjarnason and @laudrain).
Yeah I can also think of say CMS-plugins and or services allowing bloggers/websites to create a web publication, similar to the ones already available for EPUB. It’s really hard to tell what the title will be if it’s undefined in the config…
The
<title>
in HTML is used well-beyond our use case in WP and it's semantically not the same thing.
May be true (although I am not convinced) but when it is really different, the author can use the name
property without further ado. That is why the 'curation' is relevant.
The question is where the 80/20 cut is. Ie, if we talk about Web Publication (and not average Web Pages), what is the estimated probability that the title used for the primary entry page and the name
of publication will be different (we are talking about the embedded case). Put it another way, what is the percentage of authors that will be forced to unnecessary duplicate the same data.
@JayPanoz I am not worried about CMS plugins. Those are programs that can generate the name
property. My worry is the author of a, say, scholarly article who will be asked to produce a final version of his/her paper in HTML/WP, and will be asked to choose a suitable title.
@iherman Well I personally am because they very often chose the most practical way out, which isn’t necessarily the spec-compliant one. If say you’re designing a WordPress plugin, then the title in the settings is probably the easiest/most sensible fallback when the title is undefined in the plugin’s UI – and it can be significantly different from a publication’s title.
If I remember well, as we are in a JSON-LD context, don't we have already the name property for the title in any of our types ? Or our WP context doesn't inherit from Thing?
If it's not required in the manifest, the
of the HTML returned by the publication address seems like the nature next candidate.
@HadrienGardeur What if we make title required? cc @iherman ?
@TzviyaSiegman having the title required in our infoset would reduce the inconsistency of de-facto requiring the title through the processing while not requiring it in the infoset at the same time.
This would not change my perception that the title of an HTML page and the title of a Web Publication are not semantically the same thing.
title of an HTML page and the title of a Web Publication are not semantically the same thing.
Not necessarily. But for the use case we have (embedded manifest to a, say, article) I suggest the two are mostly identical.
Here's a proposal to solve the current situation:
The first and second points solve an inconsistency in our spec language:
It's a weird situation because while the title is not required, it is de facto always present because of the canonicalization.
As for my third point, I still believe that it's important to have the ability to override the value of the HTML primary entry page if you want to and I don't see any reason to force authors to use the same value. With these new requirements, author's won't have to duplicate the information anyway if it's the same.
@HadrienGardeur this is even stronger than my original version, but I am fine with this approach.
(Admin comment: if we do this, I will close this PR and create a new one, because merging this PR would become an incredible mess...)
Adopted the wording proposed in #325 (and updated the diagrams)
fix #325
Preview | Diff