w3c / web-annotation

Web Annotation Working Group repository, see README for links to specs
https://w3c.github.io/web-annotation/
Other
141 stars 30 forks source link

Dropping type from ... what? #67

Closed azaroth42 closed 8 years ago

azaroth42 commented 9 years ago

While dropping boilerplate content that just states something that is necessarily true is always good to get rid of, which types can we drop and which must we retain?

I propose:

And thus at the end of the list ... it seems like type is actually reasonably important to keep around?

tcole3 commented 9 years ago

-0.5 for the current SHOULD for dctypes (Dataset, Image, MovingImage, Sound, Text), section 3.2.2 (if this is what you meant in your 6th bullet). In my experience, those tasked with providing the dctype will get it wrong a surprising amount of the time (hence need for the caveat about when not to use Text). Additionally, these days, dctypes like Image and Text are not actionable and so not helpful, e.g., quite likely I will want to treat gif, jp2, svg (all Image) bodies each differently in my application, so even if you tell me its Image, I'm going to have to check before doing anything. Often you really need format (i.e., MIME type) at least to really act on the information (even this not always enough), and if I have format than dctype is unnecessary. While intuitively some information about the nature of a body or target may seem better than none, often I find that it just encourages questionable assumptions. We can't (and shouldn't) discourage the use of dctype, but neither should we recommend it. My two-cents.

Otherwise +1 for the rest of your list and your conclusion.

csarven commented 9 years ago

I would suggest a name along the lines of oa:content over oa:text and rdf:value to represent any type of text content or markup.

iherman commented 9 years ago

@azaroth42, you said:

I propose:

  • Keep Annotation so that systems can know that that it's an annotation at all!

I am fine with that.

  • Drop SpecificResource, especially if it's the only valid object of hasBody / hasTarget
  • Drop EmbeddedContent and use a specific property rather than rdf:value, such as oa:text

I am fine with both

  • Keep Selector, State, Style and Multiplicity subclasses, as knowing what sort of thing it is determines how the client will process it. Also, it provides flexibility for extension, where further communities can feel secure in creating new types.

I presume the Selector as a class is (almost) never used by itself, only through its subclasses. I am fine with that.

For States I am not sure using, e.g., HTTPRequestState brings too much. I would rather use another property (instead of value) to denote the real meat of the state (also to avoid multiplexing meaning for a property) and not use the typing. But I am a bit neutral on this.

I am also not sure about Style. We have a property called stylesheet. What additional information do I get if I say:

"stylesheet": {
    "@id": "http://example.org/style1",
    "@type": "oa:CssStyle"
  },

or even if I say:

 "stylesheet": {
    "@id": "http://example.org/style1",
    "@type": ["oa:CssStyle", "oa:EmbeddedContent"],
    "value": ".red { color: red }",
    "format": "text/css"
  },

I am afraid none. I would prefer to simply use the "stylesheet" properly. We can say that usage of that class is a MAY (because, maybe, we will have other types of stylesheets around, though I do not really see that coming), but certainly not stronger.

I am fine with the Multiplicity class, I guess this is probably necessary.

  • Keep the distinction between human, organization and software agents.

Yes, that is probably fine.

  • Keep the difference between an Image and a Video so that clients how how to render the resource, even if the Annotation doesn't give a specific format (which may not be known, and may not be important to capture. Is it a jpg or a png? The client doesn't care, it's going to put it in an tag regardless)

I am a bit undecided on this. First of all, we do use Dublin Core concepts in some places, and we use media types at other places (see the reference to "text/css" above). Ie, we do have an inconsistency. I also consider @tcole3's argument ompelling, ie, that often media types are better.

As you said in another mail, MAY is probably the maximum we should go.

  • Tag and SemanticTag will go away anyway.

Yep.

Admin issue: this is actually closely related to issue #61. Should we close that one?

azaroth42 commented 9 years ago

I presume the Selector as a class is (almost) never used by itself, only through its subclasses. I am fine with that.

Agreed -- the main classes don't contribute anything directly.

I am also not sure about Style.

I'm fine with dropping the class from CSSStyle, as there's unlikely to be a contender there. Selectors and States are much more likely to have further subclasses from different domains.

Keep the difference between an Image and a Video

As you said in another mail, MAY is probably the maximum we should go.

Agreed.

I'll leave this open, but I think so far Rob, Tim and Ivan are in agreement.

stain commented 9 years ago

(This message was also sent by email, but GitHub didn't pick it up)

-1 to drop the dctypes.

I think the dctypes at least on the target can be important to understand what kind of annotations we're talking about.. e.g. an Image has quite a different kinds of selectors than a Text. The classes are very broad, and can't reliably be used for mime-type selection of rendering mechanism. But consider an annotation where the target has since gone 404 (or is a protected resource) so we can't check its actual mime type - then it can still be of importance to understand the annotation body (say a textual comment) if it's about a dctypes:MovingImage or a dctype:Text - e.g. the comment might say "Too much violence".

But this assumes you can do it all black and white.. what is the type of https://www.youtube.com/watch?v=ZOpUL_hqNlU ? If you do it literally by mime type, it's a text/html. By content it's a video. But actually.. the target is not particulatly MovingImage as it is one of those "music on youtube with still image" - so really semantically it is a dctypes:Sound - "a resource primarily intended to be heard." -- or with a bit of faith - a (representation of) a https://schema.org/MusicAlbum

So if I say that https://www.youtube.com/watch?v=ZOpUL_hqNlU "This is great, my uncle use to sell this in the 80s" as a comment on this - I don't mean he used to sell the HTML page or the Youtube video. But I might be able to select a type that it's Sound or Music or something like that, and then you would understand, possibly even if the youtube video goes down (as they often do)

stain commented 9 years ago

I say -1 to drop SpecificResource - unless we are going for always having it as the object of hasBody / hasTarget (e.g. SpecificResource or equivalent specified as their rdfs:range). Why? Because SpecificResource is a placeholder - and so this should be prominently marked beyond just a oa:hasSource.

Where's the issue suggesting this change to hasBody and hasTarget? It would be incompatible with earlier OA model - which should be part of the consideration.

How would oa:Choice etc. be used? Subclassing (whatever replaces) SpecificResource?

iherman commented 9 years ago

On 24 Aug 2015, at 14:31 , Stian Soiland-Reyes notifications@github.com wrote:

I say -1 to drop to SpecificResource - unless we are going for always having it as the object of hasBody / hasTarget (e.g. SpecificResource or equivalent specified as their rdfs:range).

You should look at:

https://lists.w3.org/Archives/Public/public-annotation/2015Aug/0209.html

this is the direction we may be going…

Ivan

Why? Because SpecificResource is a placeholder - and so this should be prominently marked beyond just a oa:hasSource.

Where's the issue suggesting this change to hasBody and hasTarget? It would be incompatible with earlier OA model - which should be part of the consideration.

How would oa:Choice etc. be used? Subclassing (whatever replaces) SpecificResource?

— Reply to this email directly or view it on GitHub.


Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

shepazu commented 9 years ago

I like Tim's suggestion to use MIME types instead of dctypes, for a few reasons:

If content negotiation is necessary, we could perhaps allow multiple values; this is going to be set by the UA (usually the client) anyway, not the user, so it can easily establish the correct MIME type when the resource is inserted.

I would go farther than Tim, and suggest that dctype not be included in the spec; if others want to use it, or any other custom property, they are free to do so, but having it in the spec encourages its use, which I suspect is a bad pattern.

Stian mentions the case of a YouTube video, and makes the claim that it's a video, not an HTML page; but that's not correct, that URL he provided points to an HTML page that contains a video, and we shouldn't stray from the Web in this abstracted way. We cannot hope for interoperability in that behavior, unless the UA forces the user to select the dctype (how would the user choose?), or unless we somehow mandate that UAs consistently chooses the media dctype when presented with mixed-MIME-type resources (like HTML pages with videos or images) and always. I simply don't see how we can realistically use dctypes in a helpful way, while MIME types are clearly and pragmatically useful.

jjett commented 9 years ago

-1 from me for the reasons that Stian has mentioned.

I strongly believe that we're conflating file object types with content types here. .jpg, .tif, .svg, etc. can just as easily contain text as images, and similarly .html, .pdf, .docx, etc. can just as easily contain images as text.

The inclusion of dctype (or some equivalent) is going to be an invaluable indicator of annotator intent in cases where specifiers fail to be resolvable and only the entire source document can be rendered to the end user. Reserving a cue for them, that the annotation body is intended to target the video and not the entire html document is likely to be our best bet for a graceful failure, without which it may become difficult to figure out what portion of the document the annotation body was intended to remark upon.

Regards,

Jacob


Jacob Jett Research Assistant Center for Informatics Research in Science and Scholarship The Graduate School of Library and Information Science University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA (217) 244-2164 jjett2@illinois.edu

On Wed, Aug 26, 2015 at 9:38 PM, Doug Schepers notifications@github.com wrote:

I like Tim's suggestion to use MIME types instead of dctypes, for a few reasons:

  • It prevents the duplication of information (MIME types in the headers, and dctypes in the annotation); duplication of information, especially when it can get out of sync, it dangerous; as Tim described, a mismatched dctype could easily be chosen
  • It reuses a well-known and predictable mechanism that is universal to the Web and the Internet, rather than an RDF/LD-specific mechanism
  • It helps with direct processing (again, as Tim said).

If content negotiation is necessary, we could perhaps allow multiple values; this is going to be set by the UA (usually the client) anyway, not the user, so it can easily establish the correct MIME type when the resource is inserted.

I would go farther than Tim, and suggest that dctype not be included in the spec; if others want to use it, or any other custom property, they are free to do so, but having it in the spec encourages its use, which I suspect is a bad pattern.

Stian mentions the case of a YouTube video, and makes the claim that it's a video, not an HTML page; but that's not correct, that URL he provided points to an HTML page that contains a video, and we shouldn't stray from the Web in this abstracted way. We cannot hope for interoperability in that behavior, unless the UA forces the user to select the dctype (how would the user choose?), or unless we somehow mandate that UAs consistently chooses the media dctype when presented with mixed-MIME-type resources (like HTML pages with videos or images) and always. I simply don't see how we can realistically use dctypes in a helpful way, while MIME types are clearly and pragmatically useful.

— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_web-2Dannotation_issues_67-23issuecomment-2D135261833&d=AwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=npggDwlZ6PziBzPBZthSo0f8iGOgRMf9ulO6o4WwfiA&m=dEy6JirabkmExLuyxlI6mVnxscG6k6McVWVaF2WPyrE&s=-5UlAO55BNaEJEBG-EeW9pcNG2qNLfIwG9Cc7CMAecw&e= .

BigBlueHat commented 9 years ago

@stain's got an interesting point with the YouTube URL with the Sound-related annotation. However, I'm not sure providing a dctypes value would actually make the accuracy of the annotation any better--as these days there's likely to be more than one video (ads, etc) on a YouTube (etc) page and therefore more than one "Sound."

Allowing the video itself within that page to be annotated while not only referencing the buried URL of the actual video--which you can't actually find/get without some serious XHR sniffing.

Additionally, there'd be a need to express that one was talking about the sound of the embedded "actual" video vs. just some "Sound" on the URL which loaded the page in which the video was originally found.

Here's where it gets fun, though. :smile:

That YouTube URL--which actually does contain microdata using schema.org says the following (output from copy/pasting microdata HTML into http://foolip.org/microdatajs/live/ and then nursing it into JSON-LD with help of http://json-ld.org/playground/):

{
  "@context": "http://schema.org/",
  "@id": "http://www.youtube.com/watch?v=ZOpUL_hqNlU",
  "properties": {
    "author": [
      {
        "properties": {
          "url": "http://www.youtube.com/user/Quart35"
        },
        "type": "http://schema.org/Person"
      },
      {
        "properties": {
          "url": "https://plus.google.com/110198108412316165411"
        },
        "type": "http://schema.org/Person"
      }
    ],
    "channelId": "UCgiagVp0FL612Q2nRHJAtQw",
    "datePublished": "2010-05-01",
    "description": "Kool & The Gang: Get Down On It Tavares: C'est La Vie Eddy Huntington: Up & Down Italian Boys: Forever Lovers Morgana: Ready For Love Kool & The Gang: Fresh ...",
    "duration": "PT9M27S",
    "embedURL": "https://www.youtube.com/embed/ZOpUL_hqNlU",
    "genre": "Music",
    "height": "360",
    "interactionCount": "89144",
    "isFamilyFriendly": "True",
    "name": "Max Mix 5 (2ª Parte)",
    "paid": "False",
    "playerType": "HTML5 Flash",
    "regionsAllowed": "AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BQ,BR,BS,BT,BV,BW,BY,BZ,CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CW,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR,GA,GB,GD,GE,GF,GG,GH,GI,GL,GM,GN,GP,GQ,GR,GS,GT,GU,GW,GY,HK,HM,HN,HR,HT,HU,ID,IE,IL,IM,IN,IO,IQ,IR,IS,IT,JE,JM,JO,JP,KE,KG,KH,KI,KM,KN,KP,KR,KW,KY,KZ,LA,LB,LC,LI,LK,LR,LS,LT,LU,LV,LY,MA,MC,MD,ME,MF,MG,MH,MK,ML,MM,MN,MO,MP,MQ,MR,MS,MT,MU,MV,MW,MX,MY,MZ,NA,NC,NE,NF,NG,NI,NL,NO,NP,NR,NU,NZ,OM,PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,SS,ST,SV,SX,SY,SZ,TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW",
    "thumbnail": {
      "properties": {
        "height": "360",
        "url": "https://i.ytimg.com/vi/ZOpUL_hqNlU/hqdefault.jpg",
        "width": "480"
      },
      "type": "http://schema.org/ImageObject"
    },
    "thumbnailUrl": "https://i.ytimg.com/vi/ZOpUL_hqNlU/hqdefault.jpg",
    "unlisted": "False",
    "url": "http://www.youtube.com/watch?v=ZOpUL_hqNlU",
    "videoId": "ZOpUL_hqNlU",
    "width": "480"
  },
  "type": "http://schema.org/VideoObject"
}

Things of note:

So, given that (common) scenario (and this case there's actually metadata!), would dctypes be sufficient for providing clarity?

Certainly providing format here would not get you any useful value / clarity as the YouTube URL can (afaik) only return HTML, so adding "format": "video/ogg" (as if...) wouldn't make any difference in accuracy...and in fact be worse.

It's a tangled mess to be certain. Perhaps lower the SHOULD to a MAY would do the trick. Good discussion regardless. :smile:

azaroth42 commented 9 years ago

-1 to dropping the content types and relying exclusively on media types. Unless I'm out of touch (and @tilgovi @BigBlueHat please correct or confirm), you cannot get access to the media type of a resource in javascript. To be concrete, you cannot determine (for example) whether an image is image/jpeg or image/png in javascript in the browser.

Thus 99% of clients would not be able to actually generate the format information at all, and thus they would not be able to render the target or body resources, as there would be no way to determine how to generate the HTML that would include them. For example:

{
  "body": {"id": "http://example.org/something"},
  "target": {"id": "http://example.com/somethingElse"}
}

If the body is an image, you would want to include it in an img tag, but that would fail it it wasn't an image.

shepazu commented 9 years ago

Rob, you can ask the server to tell you the MIME type of a resource that it's serving. You can also detect MIME types client-side via the FileReader API: http://stackoverflow.com/questions/18299806/how-to-check-file-mime-type-with-javascript-before-upload

shepazu commented 9 years ago

Jacob, Stian, can you please explain the UI/UX workflow a UA would use to set dctypes that are different than the MIME type of the resource?

I do understand your intent, but I'm less certain that we will be able to get interoperability on this feature, especially in v1 of the spec. The model allows for custom properties, so UAs could choose to add dctype even if it's not part of the official Data Model, and if it later turns out that we can get interop, we can add it to v2 of the spec.

azaroth42 commented 9 years ago

Sorry, I should have been clearer. You can certainly make a HEAD request to the URI and pull the media type value from the Content-Type header... but if you just have the image in the DOM, the information isn't available. So the requirement would be that clients must make those requests in order to add the media type.

I guess I'm really -0, as it seems unnecessary and inefficient to require those requests, but indeed it is possible.

BigBlueHat commented 9 years ago

@shepazu sadly "asking the server" involves a separate XMLHttpRequest which may get a completely different response or be prevented altogether (CORS, CSP, etc). The strange fact is that after 20+ years of browser building no one's actually made the originating request data available to code running inside the browser on the results of that request. I find it baffling...

There's some likelihood that the new WHAT WG Fetch API but even that is primarily focused on additional requests--not the original one made by the user. Even within browser extension development it's non-trivial to accomplish, and often relies on watching every request that comes through (regardless of it being annotated or not) rather than just asking the current Window/Tab what it's headers were for the current request.... See what I mean? Mind numbing...

And, as @azaroth42 just pointed out while I was typing this, you'd have to do an XHR HEAD request for every included element in the page that was being annotated. I'd agree that it's unnecessary and inefficient. If a system wants to store it, by all means, but it shouldn't be required.

BigBlueHat commented 9 years ago

@shepazu also FileReader and Blob only work on files added to a loaded document and don't give you access to the document itself. So you'd have to build an in-browser client for people to use a custom built file loader vs. their using "File | Open..." or otherwise opening the file directly in their browser (generating a file:/// URL, etc).

In the end, browsers haven't cared nearly enough about Content Negotiation for people to be able to do serious things with it--which is why folks keep "sniffing" URLs hoping somebody the .pdf stands for a file type and not any of these things.

Also, on this note, who should we be talking to in the W3C to get that fixed? :smiley:

shepazu commented 9 years ago

There's also less sophisticated, brute-force ways of detecting the likely MIME type, like looking at the file extension…

Ultimately, I think it's a mistake to have duplicate information about media type, and if we're going to have a type identifier, and have to choose only one mechanism, I think MIME type is a far Webbier solution than dctype. But I'm not going to die on this particular hill.

BigBlueHat commented 9 years ago

The Web doesn't have a feature called "file extension." :smile: http://www.w3.org/DesignIssues/Axioms.html#opaque

What's the media type of http://github.com/ or https://github.com/mozilla/pdf.js ?

Regardless, I think we need to narrow in on some use cases for both dctypes and format. My preference would be to keep them both, lower them from SHOULD to MAY, and go from there.

Would that put this issue to rest? If not, what's needed to help close the loop?

shepazu commented 9 years ago

In 1996, when that was written, the Web didn't have many features that it has today. Today, we have webapps and the FileReader API. W3C Recommendations regularly include file extensions, such as *.svg: http://www.w3.org/TR/SVG/intro.html#MIMEType

HTML5 defines an algorithm that includes the concept of file extensions, in 4.8.3 "Downloading resources": http://www.w3.org/TR/html5/links.html#concept-extension

IANA includes a file extensions as part of the registry for a MIME type: https://www.iana.org/form/media-types

IETF defines file name extensions in RFC6838: http://tools.ietf.org/html/rfc6838#section-4.12

File extensions are not universal, but it is a pragmatic heuristic for many resources on the web, if there's no other way to get the MIME type. It's pedantic to pretend otherwise.

But my real concern is interoperability, and I'm skeptical we can get there on this feature for v1. I'd be fine with having them as a MAY; I could live with having them as a SHOULD. I'd be unhappy having them as a MUST.

In any case, it ultimately comes down to test suites and implementation reports. If we define how dctypes and format should be used, and we have multiple interoperable implementations, then that's good enough for me.

jjett commented 9 years ago

+1 from me to only relax the dctypes to a MAY.

Having reread section 3.2.2 and 3.2.3 more carefully, it's clear that points about how dctypes:Class and dc:format can be exploited are a bit backwards. Typically we are going to want look for format first if available and then fall back to dctype if that format is an opaque document format (like html) for the kind of rendering that 3.2.2 is suggesting.

I suggest that the language regarding applications determining and rendering resources be moved to 3.2.3 and new text describing dctypes as the backup plan for rendering resources be added. We should leave a note that at the developer's preference it can be used as a shortcut for determining how to render certain kinds of resources exactly as currently described.

I further suggest swapping locations of 3.2.2 and 3.2.3 so that body and target metadata is discussed first and body and target classes (I think we mean the type of content they contain but possibly we mean to repeat their type of format) is discussed second.

With regards to UI/UX I was imagining both the scenario that is already described (client is given a dctypes:Image resource and knows to wrap it in an element with the appropriate src attribute) and also scenarios where dctypes:Image is leveraged by an annotation system's IR feature to retrieve at the user's request, all of the annotations targeting images or all of the annotations that annotate something with images or all of the annotations that have images as a part of them.

Regards,

Jacob


Jacob Jett Research Assistant Center for Informatics Research in Science and Scholarship The Graduate School of Library and Information Science University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA (217) 244-2164 jjett2@illinois.edu

On Thu, Aug 27, 2015 at 1:43 PM, BigBlueHat notifications@github.com wrote:

The Web doesn't have a feature called "file extension." [image: :smile:] http://www.w3.org/DesignIssues/Axioms.html#opaque https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_DesignIssues_Axioms.html-23opaque&d=AwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=npggDwlZ6PziBzPBZthSo0f8iGOgRMf9ulO6o4WwfiA&m=1Et1o-_suH0El1hnBkw_NQ3NSPp6U2P_sJeL2fnbjU0&s=RQ6lkys2lDeDwEVfctqYPQ_i-_1UCLA1K4JHtcV0-Jo&e=

What's the media type of http://github.com/ https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_&d=AwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=npggDwlZ6PziBzPBZthSo0f8iGOgRMf9ulO6o4WwfiA&m=1Et1o-_suH0El1hnBkw_NQ3NSPp6U2P_sJeL2fnbjU0&s=zmpi1_veV4LbJk93BEMoTnLEm9dRRlsZudC2fY6SX58&e= or https://github.com/mozilla/pdf.js https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mozilla_pdf.js&d=AwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=npggDwlZ6PziBzPBZthSo0f8iGOgRMf9ulO6o4WwfiA&m=1Et1o-_suH0El1hnBkw_NQ3NSPp6U2P_sJeL2fnbjU0&s=NxPPNytTLAGDYtq8mBit2-L4v9ELOL3tPXAesN6wR0o&e= ?

Regardless, I think we need to narrow in on some use cases for both dctypes and format. My preference would be to keep them both, lower them from SHOULD to MAY, and go from there.

Would that put this issue to rest? If not, what's needed to help close the loop?

— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_web-2Dannotation_issues_67-23issuecomment-2D135519014&d=AwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=npggDwlZ6PziBzPBZthSo0f8iGOgRMf9ulO6o4WwfiA&m=1Et1o-_suH0El1hnBkw_NQ3NSPp6U2P_sJeL2fnbjU0&s=cdq-MQwzRtPjDDAq0MTuH4RDabDfmkzcb12yAjydqN0&e= .

BigBlueHat commented 9 years ago

@shepazu let's move this particular debate to a different forum. :smile: Party thoughts below.

In 1996, when that was written, the Web didn't have many features that it has today. Today, we have webapps and the FileReader API.

FileReader API, fwiw, is about the file system and not Web related. It's a point where the user agent does something other than the Web--hence the need to support other ways of doing things than Content-Type headers, etc.

W3C Recommendations regularly include file extensions, such as *.svg: http://www.w3.org/TR/SVG/intro.html#MIMEType

Right. For SVG files stored on a file system.

HTML5 defines an algorithm that includes the concept of file extensions, in 4.8.3 "Downloading resources": http://www.w3.org/TR/html5/links.html#concept-extension

Correct. For downloading files to a filesystem via a browser.

IANA includes a file extensions as part of the registry for a MIME type: https://www.iana.org/form/media-types

Yep. This registry is not just for the Web or HTTP or email. It's also for file systems which use extensions in place of content negotiation.

IETF defines file name extensions in RFC6838: http://tools.ietf.org/html/rfc6838#section-4.12

Right. Again, these registrations have a wider use than just the Web or email. It's also been being done this way since at least 1996: http://tools.ietf.org/html/rfc2048#section-2.2.9

File extensions are not universal, but it is a pragmatic heuristic for many resources on the web, if there's no other way to get the MIME type.

There are a few ways to do media type detection. There need to be more.

It's pedantic to pretend otherwise.

Hardly. https://github.com/BigBlueHat/this-is-not-a.pdf

Happy to chat (elsewhere) about it more fully if you'd like. It's an important feature of the Web that's kept it ticking for as long as it has. Breaking it has some wide reaching consequences.

shepazu commented 9 years ago

Jacob, I'm assuming IR means "Information Retrieval"? That could just as easily work with a set of MIME types as it could with a set of dctypes, right?

Your example isn't really what I meant by UA UI/UX, but if we're converging on a MAY, I don't think it matters.

shepazu commented 9 years ago

@BigBlueHat, let's agree to disagree about whether a webapp can usefully use the concept of file extensions; I don't think either of us is going to persuade the other. But nobody was suggesting using file extensions, MIME types, or dctypes to do something that would break the Web, so I think we can de-escalate the danger level.

jjett commented 9 years ago

Yes, IR = Information Retrieval

No, MIME Types will be both too granular and are not actually particular to content type, which is usually what end users actually want when searching by type. For instance, when searching for images, the last thing that they actually want to see is a bunch of .tifs containing scanned text.

I'd appreciate it if you could provide an illustrative example of UA UI/UX. I've this feeling that every time you say users you really mean developers.

Regards,

Jacob


Jacob Jett Research Assistant Center for Informatics Research in Science and Scholarship The Graduate School of Library and Information Science University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA (217) 244-2164 jjett2@illinois.edu

On Thu, Aug 27, 2015 at 3:19 PM, Doug Schepers notifications@github.com wrote:

@BigBlueHat https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_BigBlueHat&d=AwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=npggDwlZ6PziBzPBZthSo0f8iGOgRMf9ulO6o4WwfiA&m=CNqcFhI0azDKUJYZg7kahZkgpOqLX7aQnJKfeKJELRQ&s=NWsYFC-t9oZIv95koGiulgDRY7c4YqbxDTaDrmFFfFI&e=, let's agree to disagree about whether a webapp can usefully use the concept of file extensions; I don't think either of us is going to persuade the other. But nobody was suggesting using file extensions, MIME types, or dctypes to do something that would break the Web, so I think we can de-escalate the danger level.

— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_w3c_web-2Dannotation_issues_67-23issuecomment-2D135541334&d=AwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=npggDwlZ6PziBzPBZthSo0f8iGOgRMf9ulO6o4WwfiA&m=CNqcFhI0azDKUJYZg7kahZkgpOqLX7aQnJKfeKJELRQ&s=qypoKrhSCuNozW2cPcsrL5xITRc94pLUeNaNrBwQ4oQ&e= .

tkanai commented 9 years ago

I'm afraid I'm missing something, but in my understanding, both body and target just provide IDs (URI, Uniform Resource Identifier). Sometimes it could be ISBN, ("urn:isbn:4-8399-0454-5 for example). If it is needed to point to remote resource location in body, I think it would be much better to use another property and set the location, for instance url as Youtube did, instead of using ID. And then MIME type or any other resource information should be set as a part of resource data. There are some cases that resource data, or information, would be stored in the same graph database, id property is still useful to identify the resource data node.

iherman commented 9 years ago

@tkanai, this is a much larger issue. I am not saying that it is not a valid issue, on the contrary, but I believe it would deserve a separate discussion at some point.

The problem is:

I am not sure what the right approach is but it is probably something we should document somewhere.

Should we move this into a separate issue?

tkanai commented 9 years ago

@iherman I'm afraid that even the resolver can not tell URIs as location from URIs as ID, in the Linked Data World. For example, "http://www.gutenberg.org/ebooks/11" is provided as URI, or Unique Identifier, in an epub file. We can also reach to the web page where we can download the epub, with the URL. I'm wondering if we can distinguish whether an annotation the target URI of which is "http://www.gutenberg.org/ebooks/11" is for the epub or the web page. I might be wrong but I don't think we can. Besides, I don't think epub is only the case which uses URI as ID, and I think there are many objects which use URI (eg. Linked Data nodes), and that's why I think it is worth to tell them apart, rather than utilizing any "type" properties. For further discussion about URI, yes we should.

iherman commented 9 years ago

@tkanai this is a good example, thanks. In fact, it is the same issue as the youtube example used earlier in this thread, right? The "http://www.gutenberg.org/ebooks/11" is used to identify the book, but when resolved it actually leads to a web page with different representations (epub, pdf, etc) of the book, just like the video on a youtube page...

(We can complain about the sloppy usages of these ID-s, but this is the reality out there...)

I am not sure whether the thread got to an equilibrium point on how to handle the issue:-(

stain commented 9 years ago

Right, this is very much the same example. And this is of course also recognized as the HTTP Range-14 problem of separating the resource and the web page about that resource - https://en.wikipedia.org/wiki/HTTPRange-14

.. In this case complicated by the fact that the identifier < http://www.gutenberg.org/ebooks/11> here stands as the abstraction for various information resources (pdf, epub), rather than a real-life non-information resource like a person or building.

In this example assigning a type of dctypes:Text does not help to disambiguate. A dc:format "text/html" would help to show you mean the web page though, as it would not make sense to provide dc:format on the abstract resource.

Still, in both the case of using typing and dc:format these would be applied directly to http://www.gutenberg.org/ebooks/11 - so if you had two annotations in the same graph, and one using the web page about the ebook, and another using the identifier for the ebook, you wouldn't know which one is which.

If you are talking about the identifier you can also provide downlinks to the representations, e.g.

http://www.gutenberg.org/ebooks/11 prov:generalizationOf < http://www.gutenberg.org/files/11/11-h/11-h.htm>, < http://www.gutenberg.org/ebooks/11.epub.images>, <...> . http://www.gutenberg.org/ebooks/11 dcterms:hasFormat < http://www.gutenberg.org/files/11/11-h/11-h.htm>, < http://www.gutenberg.org/ebooks/11.epub.images>, <...> .

Perhaps we could formalize this in the specification? Obviously if you use the abstract http://www.gutenberg.org/ebooks/11 as the target of annotating the epub book, then any selectors get trickier. I would have preferred for the annotation to say really which representation was being annotated, and then have a link to there to the abstract identifier - thus the abstract id could be used for discovery, while the representation URI can be used to know how to apply selectors, rendering etc.

This is a similar challenge as we had over dealing with content negotiation, where we introduced oa:hasState - http://www.w3.org/TR/2014/WD-annotation-model-20141211/#request-header-state Could oa:hasState and friends also be appropriate here?

On 3 September 2015 at 06:15, Ivan Herman notifications@github.com wrote:

@tkanai https://github.com/tkanai this is a good example, thanks. In fact, it is the same issue as the youtube example used earlier in this thread, right? The "http://www.gutenberg.org/ebooks/11" is used to identify the book, but when resolved it actually leads to a web page with different representations (epub, pdf, etc) of the book, just like the video on a youtube page...

(We can complain about the sloppy usages of these ID-s, but this is the reality out there...)

I am not sure whether the thread got to an equilibrium point on how to handle the issue:-(

— Reply to this email directly or view it on GitHub https://github.com/w3c/web-annotation/issues/67#issuecomment-137334314.

Stian Soiland-Reyes Apache Taverna (incubating), Apache Commons RDF (incubating) http://orcid.org/0000-0001-9842-9718

tkanai commented 9 years ago

@iherman, @stain Actually it is slightly different. Sorry for the unclear explanation. My point is that even if the epub file is stored in the local PC, dropbox, Reading systems, or distributed from Apple or Kobo, it still uses the same URI, because the URI is written in the epub metadata as ID. Then, as long as the annotation target node points to the ID, the readers of the title can share annotations each other.

If the epub is accessible only from the URL, yes you are right, it is completely the same case with the youtube, and as @stain suggested, we should assign a kind of fragment identifier with the URL.

iherman commented 9 years ago

@tkanai you are right. In this respect the identification of e-books is some sort of a hyper HTTPRange-14 issue:-)

My take is: we will not solve this problem; it is way beyond what this WG can handle. I believe that if we stop at the point of saying MAY or SHOULD for the usage of dctypes, we should be fine and this is as far as we can go. Let us leave the HTTPRange-14 issue (hyper or plain) to some other groups...

azaroth42 commented 8 years ago

Resolved in Oct 2015 WD