w3c / epubcheck

The conformance checker for EPUB publications
https://www.w3.org/publishing/epubcheck/
BSD 3-Clause "New" or "Revised" License
1.65k stars 402 forks source link

check that the manifest only lists publication resources #1452

Closed rdeltour closed 1 year ago

rdeltour commented 1 year ago

The spec says:

the manifest MUST only list publication resources

I don't think this is covered by EPUBCheck.

For each manifest item, we could lookup the reference registry to see if a reference to the resource was found, and issue an error if not. We would issue a mere usage message if the EPUB contains scripted content, as the item could potentially be a legit publication resource used in scripting.

rdeltour commented 1 year ago

For each manifest item, we could lookup the reference registry to see if a reference to the resource was found, and issue an error if not. We would issue a mere usage message if the EPUB contains scripted content, as the item could potentially be a legit publication resource used in scripting.

@mattgarrish do you think that check would be useful and/or reasonable?

Existing EPUBs may list things in the manifest that are not strictly speaking publication resources (e.g. unused images). As we did not check this in the past, so I'm wondering if it's opening a can of worms…

mattgarrish commented 1 year ago

The requirement is only there to keep linked resources and resources travelling in the container from littering the manifest, since reading systems don't need these (I guess it might keep them from being extracted for reading). It was never meant to trap former publication resources that are still listed but have stopped being referenced, so I agree this does seem like a can of worms to implement now.

It is useful to know there are unused resources still kicking around in the publication from a cleanup perspective, but I don't know this is worth the headache of making errors. And I don't think there's any simple fix, either here or for the spec. We're just too far along in EPUB 3.

What if we deliberately ignore the spec and make it a usage message so at least publishers will know if a resource is still used but without adding new errors?

rdeltour commented 1 year ago

What if we deliberately ignore the spec and make it a usage message so at least publishers will know if a resource is still used but without adding new errors?

Yes, it looks like a good compromise, I was thinking the same.

The requirement is only there to keep linked resources and resources travelling in the container from littering the manifest, since reading systems don't need these

IMO the requirement is too strong. We already say that publication resources MUST be listed; we don't need to add that others MUST NOT, especially when this cannot be checked.

Also, EPUBCheck reports files present in the container but not listed in the manifest as a WARNING (OPF-003). It has for a looong time. I know it's contrary to the spec, but I think it was added back in the day to warn about littering system files (.DSStore and the like).
But the current state of affair is the EPUBCheck implicitly enforces (by a warning) a spec violation 🙃.

In any case, I for one always have a little difficulty understanding what is a publication resource.

For instance, when the exemption section says:

This exemption allows EPUB creators to include resources in the EPUB container that are not for use by EPUB reading systems.

if a resource is not for use by RS, is it even a publication resource, and in that case is it even relevant to create exemption rules for it?

mattgarrish commented 1 year ago

if a resource is not for use by RS, is it even a publication resource

That's what makes wading into this so complicated. We've carved out a lot of niche cases over the years for resources that don't quite conform to the standard.

But I was forgetting last night that resources travelling in the container still have to be listed in the manifest. It's just that they're exempt from fallback requirements. How do you differentiate a resource travelling in the container from one that used to be a publication resource and is no longer in use, though? That still remains the problem I don't see a way of writing around.

But that particular problem aside, the only resources in the container that shouldn't be in the manifest are linked resources (the ones that aren't also publication resources!) and the stuff in META-INF.

It's not clear why you don't list files in META-INF in the manifest, though -- there should probably be a statement that because they're reserved files for processing the OCF, they don't need listing.

If we add that statement, then I think we could take the problematic "must not" here and turn it into a note that just mentions the resources that don't get listed (or even just remove it altogether, as at that point the manifest requirements would cover all cases).

That would only leave the problem of former publication resources / resources travelling in the container, but I think that's a legit case for a usage message.

rdeltour commented 1 year ago

Ok, so assuming w3c/epub-specs#2506 is merged, that leaves us with the following EPUBCheck behavior:

About the last point: EPUBCheck historically warns about those (since #58, which was released in v1.1, 12 years ago). According to the clarified spec, this warning would become a mere usage message. This was also discussed a few years back in https://github.com/w3c/epub-specs/issues/563#issuecomment-231504335:

@mattgarrish

the working group resolved on the July 6 [2016] call not to change the rules but that epubcheck should be changed to only emit a info message about resources that are not in the manifest, linked or "manifest free" (rendition mapping doc)

I'm still slightly hesitant to change this, based on @dauwhe's original comment in #58 (and the following discussion):

A major new player in the eBook marketplace is now checking this, and rejecting titles with "unmanifested" files.

mattgarrish commented 1 year ago

I'm still slightly hesitant to change this

Ya, I thought it was a new warning, but if it's always been around...

But, the problem is the new "allowed to travel in the container without being used" resources (data sets, etc.). We can't make one camp happy without annoying the other.

We probably should have added a manifest property for the new case to make clear they've been added intentionally. What do you think @iherman? Doesn't seem like too big an ask, and it's not like there's an established base using the container to carry files around that we need to worry about yet.

I just have no idea what we'd name the property. <item ... properties="data">, maybe?

iherman commented 1 year ago

I had two worries:

  1. Are we creating a new normative feature
  2. Are the results of testing such that we may have to remove (or weaken) the fallback feature from the spec; it is the explanation of fallback that led to the introduction of this category of resources in the first place... (although I realize that even without fallbacks the issue would arise).

It seems that both of my worries are moot, so I am fine with this change. I have added some additional (minor) comments to https://github.com/w3c/epub-specs/pull/2507.

B.t.w., if the group decides it is not a good time to make the change, we can postpone it for later. It would not create any practical problems.