Make async clipboard APIs (read/write) to sanitize interoperably with setData/getData for text/html

gked commented 3 years ago

Hi All, Excel online team reached out to us on several issues they found with clipboard formats. Sanitization across legacy and async clipboard APIs is not consistent for some mime types. For example, while text/html when set with setData/getData, keeps meta tags in Chrome, async clipboard APIs strip them down. This creates an issue for online text editing applications. Target apps, processing paste, rely on meta tags to infer information about the clipboard payload source. @snianu has wrote a detailed analysis which can be found here. A matrix view of formats can be found here. We propose that we make async clipboard read/write serialization behaviors consistent with legacy clipboard API such as setData/getData. The reason being browsers already expose this information through setData/getData and making async clipboard read/write to behave in a compatible way which will ease its adoption over time.

If we can all agree on this, we will follow up with a PR clarifying the behavior in the Clipboard Spec which will also take care of #140

CC: @rniwa @whsieh @dway123 @garykac @BoCupp-Microsoft @megangardner @johanneswilm

dway123 commented 3 years ago

I'm supportive of having datatransfer and navigator.clipboard (async clipboard) APIs be consistent in their sanitization behavior of text/html (and other formats). Thank you for the detailed analysis, and for offering to help update the spec to clarify this!

annevk commented 3 years ago

How did the discrepancy arise in Chrome? Part of the plan with https://wicg.github.io/sanitizer-api/ is to get this all standardized. Does that meet the requirements you have?

cc @mozfreddyb @otherdaniel

otherdaniel commented 3 years ago

Part of the plan with https://wicg.github.io/sanitizer-api/ is to get this all standardized. Does that meet the requirements you have?

I don't know much about the specific question here, but on clipboard vs Sanitizer API: This is a v2 item. One major block of functionality missing is that the clipboard sanitizers do some form of style normalization, while the Sanitizer is presently HTML only.

If you want to help this along: I haven't found any documentation from any browser about how the style normalization actually works. It seems all browser engines have implemented something along those lines, and somehow they all forgot to document what they actually do. Describing the existing functionality would help a lot in getting this into Sanitizer API.

snianu commented 3 years ago

How did the discrepancy arise in Chrome? Part of the plan with https://wicg.github.io/sanitizer-api/ is to get this all standardized. Does that meet the requirements you have?

The DataTransfer APIs are legacy APIs that probably have this behavior due to IE and backward compatibility. This is the reason why Firefox & Chromium browsers return the HTML markup string as is when queried via getData method i.e. they don't strip out style and meta tags from the HTML markup string.

I think the issue here is that due to strict sanitization performed by async clipboard APIs (that don't mutate the DOM directly) websites lose style information, meta tags etc when content is copied from Chromium and pasted into native apps. Office online & Office native apps use proprietary style formats in the HTML and the browser's in-built sanitizer removes this content from the HTML markup when it's read/written via async clipboard APIs.

The sanitizer API can certainly be used by web authors to sanitize the HTML content if they want to, but we should give them the flexibility to decide whether to use the Browser sanitizer API or their own proprietary sanitizer service.

rniwa commented 3 years ago

How did the discrepancy arise in Chrome? Part of the plan with https://wicg.github.io/sanitizer-api/ is to get this all standardized. Does that meet the requirements you have?

No. One of the reasons we sanitize the markup content is to strip way any content that's invisible to the user so we strip away any invisible content as determined by the style & layout information. We're not gonna change that.

snianu commented 3 years ago

@rniwa I don't think that is the issue here. Although Safari doesn't remove any meta, style etc tags from the custom pasteboard format which it creates during setData, and reads from the custom pasteboard format during getData. The standard HTML format has the sanitized version, but if a site is writing and reading via DataTransfer APIs then they get the unsanitized HTML content.

rniwa commented 3 years ago

@rniwa I don't think that is the issue here. Although Safari doesn't remove any meta, style etc tags from the custom pasteboard format which it creates during setData, and reads from the custom pasteboard format during getData. The standard HTML format has the sanitized version, but if a site is writing and reading via DataTransfer APIs then they get the unsanitized HTML content.

We only apply sanitization when the content is read or written across cross origin or cross applications, not when read or written within a single origin.

dway123 commented 3 years ago

How did this discrepancy arise in Chrome?

If you want to help this along: I haven't found any documentation from any browser about how the style normalization actually works.

@snianu's understanding seems correct (this was recently documented in lots of misc CL comments).

fwiw I've drafted some documentation for my understanding of Chrome's clipboard, including a brief history/summary of how/what we sanitize.

snianu commented 3 years ago

@rniwa

We only apply sanitization when the content is read or written across cross origin or cross applications, not when read or written within a single origin.

This means you trust the clipboard content during read? I think clipboard content should always be treated as untrusted. It doesn't matter if Safari wrote the html payload to the clipboard or some malicious or trusted native apps did.

snianu commented 3 years ago

We want to standardize the clipboard sanitization procedure as it has caused issues for many popular apps while using async clipboard HTML read/write. The spec says that during write, we should only write the sanitized payload of the item, but doesn't say anything about what the process of sanitization looks like and what kind of data is expected in the clipboard item. e.g. when writing HTML payload using the async clibpoard write, what should be the format of the HTML markup string? Should it be a document fragment or complete HTML document? The read operation doesn't specify anything about sanitization and I found that Chromium is performing strict sanitization for both HTML read & write operation which breaks sites like Excel online as mentioned in this bug. I agree that we should at least strip out tags such as script & elements that have javascript: protocols associated with them as they are harmful, but we should mention explicitly what tags are stripped out in this process so it sets an expectation for the web developers & native apps who are reading these formats.

Also FWIW currently at least in Chromium & Firefox, we write unsanitized HTML content to the clipboard when web developers use setData method, and AFAIK there are no known security threats as we still use the strict sanitization when user pastes the data into an editable region of the website. Web developers have proper sanitization process when they query the HTML markup using getData[4]. Currently websites like Office online use setData & getData methods to read/write html content that don't do any sanitization to support high fidelity copy/paste content. Native apps(including Office) are already exposed to these types of HTML content from the clipboard. In the browser, we can perhaps do some level of sanitization during read (as it might affect the DOM when sites insert the fragment) if that is a concern, but we shouldn't be aggressively sanitizing content during write (to be in parity with DataTransfer APIs).

To better describe the sanitization & clipboard write process for HTML payload we propose the following:

Currently we process the HTML string as a document fragment that basically ignores the content inside the <head> element. This leads to loss of styles[1], meta tag etc. while parsing the document fragment. Instead we are proposing to parse the string provided to the async write API as an HTML document. Insert the start & end fragment comment tags within the body element, and then create a well formed HTML document.
Use the sanitizer APIs default configuration[2] to sanitize the HTML document. This makes the sanitization process consistent with what is being proposed in the sanitizer API[3].
Create the HTML header info (corresponding to each platform) and then write the HTML format to the clipboard.

[1] Here is a GIF that shows the style loss when user copies and paste from Excel online to win32 Excel app: https://drive.google.com/file/d/1Nsyp1rUKc_NF4l0n-O05snAKabHAKeiG/view [2] https://wicg.github.io/sanitizer-api/#default-configuration [3] https://wicg.github.io/sanitizer-api/#sanitizer-algorithms [4] https://github.com/ckeditor/ckeditor5/blob/9ad13ec94d77e10e4ddf678e86577acb107db3fb/packages/ckeditor5-clipboard/docs/framework/guides/deep-dive/clipboard.md

a-sully commented 3 years ago

Not entirely related, but I'll chime in about the state of image (PNG) sanitization to provide a +1 for updating the spec to make the sanitization process more explicit. The spec mentions that images should be "safe", but (as you noted) is silent on the specifics of this sanitization, leading to inconsistencies across browsers.

Currently in Chrome, images are sanitized on both read and write. Skia is used to strip almost all metadata, retaining only what fits in an SkBitmap. This behavior is inconsistent with other major non-Chromium browsers, which do not sanitize images on read (see https://crbug.com/1177229).

We've proposed to change Chrome to only sanitize images on write to align with other browsers. Note that this allows a site to read an unsanitized PNG directly from the system clipboard, so the "only sanitize on write" approach there is not acceptable for HTML data (and seems to be the opposite direction of what's being proposed above for HTML). In updating the spec to be more explicit about sanitization procedures (which I agree we should do) we'll have to spell out the process for each type we care about, since there is no prescriptive sanitization process that works for all types.

annevk commented 3 years ago

@snianu thanks for tackling this! Could you detail step 1 a bit more? If I do something like document.body.innerHTML = "<meta><style>test</style>" it shows that the fragment parser is also capable of preserving certain elements so I'd like to see more detail on the exact problem and also the proposed solution (e.g., what's a fragment comment tag?).

mozfreddyb commented 3 years ago

Early on during our work on the sanitizer, we've been eying the issue of incoherent, implicit sanitizer usage during copy&paste. I'm very much in favor of specifying things more explicitly.

mbrodesser commented 3 years ago

We want to standardize the clipboard sanitization procedure as it has caused issues for many popular apps while using async clipboard HTML read/write. The spec says that during write, we should only write the sanitized payload of the item, but doesn't say anything about what the process of sanitization looks like and what kind of data is expected in the clipboard item. e.g. when writing HTML payload using the async clibpoard write, what should be the format of the HTML markup string? Should it be a document fragment or complete HTML document? The read operation doesn't specify anything about sanitization and I found that Chromium is performing strict sanitization for both HTML read & write operation which breaks sites like Excel online as mentioned in this bug. I agree that we should at least strip out tags such as script & elements that have javascript: protocols associated with them as they are harmful, but we should mention explicitly what tags are stripped out in this process so it sets an expectation for the web developers & native apps who are reading these formats.

Also FWIW currently at least in Chromium & Firefox, we write unsanitized HTML content to the clipboard when web developers use setData method, and AFAIK there are no known security threats as we still use the strict sanitization when user pastes the data into an editable region of the website. Web developers have proper sanitization process when they query the HTML markup using getData[4]. Currently websites like Office online use setData & getData methods to read/write html content that don't do any sanitization to support high fidelity copy/paste content. Native apps(including Office) are already exposed to these types of HTML content from the clipboard. In the browser, we can perhaps do some level of sanitization during read (as it might affect the DOM when sites insert the fragment) if that is a concern, but we shouldn't be aggressively sanitizing content during write (to be in parity with DataTransfer APIs).

To better describe the sanitization & clipboard write process for HTML payload we propose the following:
1. Currently we process the HTML string as a document fragment that basically ignores the content inside the `<head>` element. This leads to loss of styles[1], meta tag etc. while parsing the document fragment. Instead we are proposing to parse the string provided to the async write API as an HTML document. Insert the start & end fragment comment tags within the `body` element, and then create a well formed HTML document.

Be aware that start and end fragment comment tags are Windows-specific. On other operating systems (e.g. Ubuntu), that will have to be handled differently.

2. Use the sanitizer APIs default configuration[2] to sanitize the HTML document. This makes the sanitization process consistent with what is being proposed in the sanitizer API[3].

3. Create the HTML header info (corresponding to each platform) and then write the HTML format to the clipboard.

How is the HTML header info platform-specific?

[1] Here is a GIF that shows the style loss when user copies and paste from Excel online to win32 Excel app: https://drive.google.com/file/d/1Nsyp1rUKc_NF4l0n-O05snAKabHAKeiG/view [2] https://wicg.github.io/sanitizer-api/#default-configuration [3] https://wicg.github.io/sanitizer-api/#sanitizer-algorithms [4] https://github.com/ckeditor/ckeditor5/blob/9ad13ec94d77e10e4ddf678e86577acb107db3fb/packages/ckeditor5-clipboard/docs/framework/guides/deep-dive/clipboard.md

mbrodesser commented 3 years ago

@snianu thanks for tackling this! Could you detail step 1 a bit more? If I do something like document.body.innerHTML = "<meta><style>test</style>" it shows that the fragment parser is also capable of preserving certain elements so I'd like to see more detail on the exact problem and also the proposed solution (e.g., what's a fragment comment tag?).

Presumably  and  were meant, see https://docs.microsoft.com/en-us/previous-versions/windows/internet-explorer/ie-developer/platform-apis/aa767917(v=vs.85).

snianu commented 3 years ago

@snianu thanks for tackling this! Could you detail step 1 a bit more?

Sure. Step 1 basically creates the document object from the HTML string provided by the web authors. It could just be a document fragment or a full HTML document . We use the DOMParser to create a well formed HTML document from the input, and then use the sanitizer API to strip out harmful contents from the markup. After this step, we insert platform specific header info into the serialized document and then write it to the clipboard.

Here is a more detailed proposal for the HTML sanitization:

First we create a DOMParser object that can be used to parse the string provided by the web authors and get a document object. We use the algorithm defined in parseFromString for text/html type to parse the html string.
Then, we want to use the Sanitizer API to strip out harmful contents such as script tags. Currently we don't have a way to parse and get a full HTML document using the sanitize method, but we want to do something like how sanitizeFor would parse an HTML document element using the html as element & doc.documentElement.innerHTML (where doc is the document from step 1) as the input

In the last step, we want to create HTML platform specific header and add it to the markup string which is the HTML document from step 2 serialized into string. Examples of platform specific headers are given below: On Windows, we have:

Version:0.9
StartHTML:<start offset of the start html tag>
EndHTML:<start offset of the end html tag>
StartFragment:<start offset of the start fragment comment tag>
EndFragment:<start offset of the end fragment comment tag>
<html>
<body>
<!--StartFragment-->
<body content goes here>
<!--EndFragment-->
</body>
</html>

On Linux we have:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

etc... Tagging @mkruisselbrink @a-sully if this proposal looks good to them.

annevk commented 3 years ago

@snianu I would also like to understand why the fragment parser does not suffice, as it seems to preserve the elements you indicated were dropped.

snianu commented 3 years ago

@annevk

@snianu I would also like to understand why the fragment parser does not suffice, as it seems to preserve the elements you indicated were dropped.

If we parse the HTML markup string in the context of body element, then it wouldn't parse the contents of the html & head elements properly. We want to process the string as a complete HTML document using the DOMParser so it parses the contents of html & head in the context that the authors intended. Also, if we inline each and every style that are applicable to the content of the body, then it would bloat the HTML markup on the clipboard. e.g. If you copy a grid from Excel Online where the cells have different formats, inlining the styles would produce a huge markup where each td & tr elements would have a different style. Having just a style element in the head not only preserves the styles applicable to all the cells, but also produces a small HTML markup on the clipboard.

annevk commented 3 years ago

But as I showed elements such as style and meta are preserved by the fragment parser. And if you put everything as a child of the body element in the end anyway I fail to see the point.

travisleithead commented 3 years ago

Sept. WG Meeting: discussed @snianu's proposal above. Good learnings. No immediate actions/resolutions.

BoCupp-Microsoft commented 3 years ago

@annevk maybe I can clarify...

@snianu's goal is process markup like this: some text into: <html><head></head><body>some text</body></html> and also markup like this:

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<style>
table
    {mso-displayed-decimal-separator:"\.";
    mso-displayed-thousand-separator:"\,";}
tr
    {mso-height-source:auto;}
col
    {mso-width-source:auto;}
td
    {padding-top:1px;
    padding-right:1px;
    padding-left:1px;
    mso-ignore:padding;
    color:black;
    font-size:11.0pt;
    font-weight:400;
    font-style:normal;
    text-decoration:none;
    font-family:Calibri, sans-serif;
    mso-font-charset:0;
    text-align:general;
    vertical-align:bottom;
    border:none;
    white-space:nowrap;
    mso-rotate:0;}
.xl16
    {color:black;
    font-family:Calibri;
    mso-generic-font-family:auto;
    mso-font-charset:0;
    background:#E7E6E6;
    mso-pattern:black none;}
.xl17
    {color:black;
    font-family:Calibri;
    mso-generic-font-family:auto;
    mso-font-charset:0;
    background:#D9E2F3;
    mso-pattern:black none;}
.xl18
    {color:black;
    font-family:Calibri;
    mso-generic-font-family:auto;
    mso-font-charset:0;
    background:#E2EFD9;
    mso-pattern:black none;}
</style>
</head>

<body link="#0563C1" vlink="#954F72">

<table width=192 style='border-collapse:collapse;width:144pt'>
<!--StartFragment-->
 <col width=64 style='width:48pt' span=3>
 <tr height=20 style='height:15.0pt'>
  <td width=64 height=20 class=xl16 style='width:48pt;height:15.0pt'>One</td>
  <td width=64 class=xl17 style='width:48pt'>Two</td>
  <td width=64 class=xl18 style='width:48pt'>Three</td>
 </tr>
 <tr height=20 style='height:15.0pt'>
  <td height=20 align=right style='height:15.0pt'>1</td>
  <td align=right>2</td>
  <td align=right>3</td>
 </tr>
<!--EndFragment-->
</table>
</body>
</html>

Into a DOM that represents all the content basically as written above including the attributes of the HTML element.

If we use the fragment parser we need to provide a context element. Let me know if I'm mistaken but I don't see anything in the spec about how we could we can initialize the fragment parser so that it sets up the HTML parser in the initial insertion mode. So the best we could do with the fragment parser given the second markup example I provided is to use an html element as our context and then have the parser throw away the DOCTYPE and original html node along with its attributes.

rniwa commented 3 years ago

I must point out that those start fragment & end fragment marker is Windows specific construct that doesn't apply to Apple's platforms. In Apple's platforms, it is pasteboard data provider's responsibility to serialize HTML in such a way as to preserve all its styles (typically with inline style content attributes).

BoCupp-Microsoft commented 3 years ago

I must point out that those start fragment & end fragment marker is Windows specific...

@rniwa I agree... I wasn't trying to make any particular point by including those, I was just grabbing some real markup as supplied by Excel Online to setData.

rniwa commented 3 years ago

I must point out that those start fragment & end fragment marker is Windows specific...

@rniwa I agree... I wasn't trying to make any particular point by including those, I was just grabbing some real markup as supplied by Excel Online to setData.

I get that. But your primary point is that we need some special way of initializing the fragment parser. I'm saying that this is not the case in Apple's platforms so this issue is at best applicable to some platforms; put it differently, the issue presents itself in different ways in different platforms. I don't think there is a way to create an interoperable behavior here given what gets put into the system's pasteboard / clipboard is dependent on that specific platform.

BoCupp-Microsoft commented 3 years ago

But your primary point is that we need some special way of initializing the fragment parser...

I'm glad you brought that up. Actually, the discussion on the fragment parser is a bit of a tangent and a means to an end. The primary point is captured in Anupam's first sentence in this comment:

We want to standardize the clipboard sanitization procedure...

I do think it's possible to standardize what authors should expect to be sanitized away from the data that they supply to navigator.clipboard.write for our well-known mime-types. So the key point in Anupam's post is really this:

...we want to use the Sanitizer API to strip out harmful contents...

I believe his initial thinking is to sanitize clipboard using the Sanitizer's default configuration, which leaves things like style and meta tags but removes harmful script tags and javascript: URLs etc. (more detail available in the Sanitizer spec).

@whsieh covered today in the Web Editing WG meeting that Safari prefers to also remove content that would be invisible like comments, display:none, opacity: 0, etc. So maybe we could define a configuration and/or upgrades to the sanitization algorithm that does that too if we decide that's important aspect to keep.

Any objection to standardizing what a "sanitized copy" means to remove ambiguity from the clipboard API's write algorithm?

rniwa commented 3 years ago

Any objection to standardizing what a "sanitized copy" means to remove ambiguity from the clipboard API's write algorithm?

This is an area of the browser engine (and the operating system) that we'd like to continue to make improvements on so I don't think we want to prematurely shoehorned into a very specific algorithm. Furthermore, even if someone were to make a concrete proposal and we were to agree with the generic algorithm in principle, it's very much unlikely that anything remotely close to it will be implemented in WebKit in the foreseeable feature since there are specific concerns and constraints we're dealing with Apple's platforms.

mozfreddyb commented 3 years ago

I'd be very much in favor of removing ambiguity from the term "sanitized copy". It's very vague right now.

@rniwa Can you share some of the constraints?

Maybe there's a way to remove some grade of ambiguity while respecting those constraints?

BoCupp-Microsoft commented 3 years ago

Tagging @whsieh... we have a follow-up meeting 9/24/21 for the Web Editing WG. I wonder if you could help bring forward the list of the specific concerns and constraints mentioned above?

whsieh commented 3 years ago

Tagging @whsieh... we have a follow-up meeting 9/24/21 for the Web Editing WG. I wonder if you could help bring forward the list of the specific concerns and constraints mentioned above?

If I understand correctly, the issue is really that the process of creating a "sanitized copy" is fraught with complications — I suppose these three details immediately stand out to me, though there are likely more nuances:

As discussed in the TF meeting, when sanitizing markup, WebKit loads the markup in a separate (offscreen) page and browsing context, and then serializes the loaded result into markup. This page that we use for sanitization is special, in that we forbid any script execution, but still allow script tags to be parsed as if they were going to be executed. This discrepancy is necessary in order to ensure that the page cannot craft a payload that is deemed "safe" when loaded in a browser that disables script, but is unsafe when loaded in a browser that enables script. I'm not sure this behavior is something that can or should be specified.
For compatability with older versions of Microsoft Office, we may preserve attributes on the html element in a narrow case where it contains the text xmlns:o="urn:schemas-microsoft-com:office:office". Would the specification allow for user agents to selectively preserve content like this?
The process of serializing "visible content" in the page we use for sanitization is also pretty difficult to (exactly) specify, since we rely on editing code in WebKit that determines which DOM positions are "visible" to the user (and, importantly, visually distinct from other such DOM positions) to figure out the range in the sanitized page that we should include in the final sanitized markup. For instance, if we're sanitizing <div><div>Hello</div></div>, we won't attempt to preserve the fact that there are nested div elements, since the first user-visible position is right before the "H" in the inner text node.

BoCupp-Microsoft commented 3 years ago

Thanks for sharing. Some initial reactions below for you to digest in advance of our next meetup.

As discussed in the TF meeting, when sanitizing markup, WebKit loads the markup in a separate (offscreen) page and browsing context, and then serializes the loaded result into markup. This page that we use for sanitization is special, in that we forbid any script execution, but still allow script tags to be parsed as if they were going to be executed. This discrepancy is necessary in order to ensure that the page cannot craft a payload that is deemed "safe" when loaded in a browser that disables script, but is unsafe when loaded in a browser that enables script. I'm not sure this behavior is something that can or should be specified.

What is the result of doing that? I may be missing something subtle about what you mean by "as if they were going to be executed". Are there any special characteristics that would make that document different than one produced by new DOMParser().parseFromString(htmlToSanitize, "text/html")?

For compatibility with older versions of Microsoft Office, we may preserve attributes on the html element in a narrow case where it contains the text xmlns:o="urn:schemas-microsoft-com:office:office". Would the specification allow for user agents to selectively preserve content like this?

I suppose we could allow a decision like that to be user agent specific, but my preference would be for us to discuss what the Sanitizer API allows or doesn't that we would consider harmful and then address those cases so that the attribute list can be part of the spec.

The process of serializing "visible content" in the page we use for sanitization is also pretty difficult to (exactly) specify, since we rely on editing code in WebKit that determines which DOM positions are "visible" to the user (and, importantly, visually distinct from other such DOM positions) to figure out the range in the sanitized page that we should include in the final sanitized markup. For instance, if we're sanitizing
Hello
, we won't attempt to preserve the fact that there are nested div elements, since the first user-visible position is right before the "H" in the inner text node.

You mentioned in our last working group meeting that Safari effectively does a "Select All" operation on the offscreen document and serializes the resulting range... did I get that right? I agree that editing heuristics relating to normalized selection positions would be hard to specify without some other foundational work coming first. My preference would be to understand what threat is being mitigated and see if we can propose a simpler step that could still mitigate the same threat.

rniwa commented 3 years ago

The process of serializing "visible content" in the page we use for sanitization is also pretty difficult to (exactly) specify, since we rely on editing code in WebKit that determines which DOM positions are "visible" to the user (and, importantly, visually distinct from other such DOM positions) to figure out the range in the sanitized page that we should include in the final sanitized markup. For instance, if we're sanitizing Hello, we won't attempt to preserve the fact that there are nested div elements, since the first user-visible position is right before the "H" in the inner text node.

You mentioned in our last working group meeting that Safari effectively does a "Select All" operation on the offscreen document and serializes the resulting range... did I get that right? I agree that editing heuristics relating to normalized selection positions would be hard to specify without some other foundational work coming first. My preference would be to understand what threat is being mitigated and see if we can propose a simpler step that could still mitigate the same threat.

Fundamentally, I don't think this is something we want to standardize at this point in time.

css-meeting-bot commented 2 years ago

The Web Editing Working Group just discussed Make async clipboard APIs (read/write) to sanitize interoperably.

The full IRC log of that discussion

<Travis> topic: Make async clipboard APIs (read/write) to sanitize interoperably
<Travis> github: https://github.com/w3c/clipboard-apis/issues/150
<Travis> BoCupp: pull request open in chromium to write unsanitized to the clipboard (for well-known HTML format) AND reading with well-known format if unsanitized flag provided.
<whsieh> q+
<Travis> .. specific action for 150?
<Travis> anupam: we decided that reading unsanitized HTML, web authors can opt-in to unsanitized text/html using new option (that takes a list of mime-types to read unsanitized).
<Travis> BoCupp: That's the current plan for chromium
<Travis> .. from @rniwa last comment is that this isn't something we want standardized.
<Travis> ack whsieh
<Travis> whsieh: when we last discussed, @annevk noted an API that blink would implement, but webkit would implement (but perhaps as a no-op).
<Travis> .. we wanted unsanitized flag for compat with those who already adopted the read API.
<Travis> .. how widespread is the adoption?
<Travis> .. seems more sensible to move forward without sanitization for Blink?
<Travis> .. just have the clients use the unsanitzed data?
<Travis> anupam: I check the usage, it's 0.001 / 0.002
<Travis> BoCupp: read is giving an "insert-ready" content... if these sites were relying on the browser to sanitize, could create a XSS vuln that we are creating...
<Travis> whsieh: potential for XSS is present anyway (where there is no sanitization)
<Travis> BoCupp: but we're changing the contract; there's an expectation that they have to write a sanitizer.. when they switch to read, they may have been glad that sanitizer is provided by the browser.
<Travis> .. but if update the contract--nothing will signal to these people that they need to change to start sanitizing.
<Travis> .. agree with whsieh points... but tough to change from sanitized->unsanitized.
<Travis> whsieh: so does this need to be put into the spec?
<Travis> .. seems more of a problem just for an implementation (blink)
<Travis> .. maybe something that's not in the spec could be put in place for this case?
<Travis> BoCupp: does webkit ever return unsanitized?
<Travis> whsieh: copy/paste between same origin does this.
<Travis> BoCupp: Oh, I thought you always produced a sanitized read?
<Travis> whsieh: yes, that's for getData... we have a bug to do this for clipboard...
<Travis> BoCupp: so you might have the same concern for that when you make that future change?
<Travis> whsieh: in async clipboard, they always get sanitized.
<Travis> .. don't expect there to be huge problems with getting unsanitized data.
<Travis> BoCupp: we have a security review issue open; chromium security folks were saying we needed to be explicit. I think for read? or was it write?
<Travis> .. in any case, need to cross-review with security folks.
<johanneswilm> q+
<Travis> .. suspect perhaps might not fly
<Travis> whsieh: fwiw, that's webkit's stance.
<Travis> .. maybe just not something that's in the spec.
<Travis> BoCupp: there's a usefulness to this also--having the browser produce an "insert ready" fragment (like the default action for paste)
<Travis> .. there's no primitive that exposes the browser's default processing.
<Travis> .. Can get the clipboard data from a read today without needing complicated workarounds.
<Travis> ack johanneswilm
<Travis> johanneswilm: if usage is low, and text editing is happening in only a handful of libraries--could we check with them on if that's the major usage?
<BoCupp> q+
<Travis> .. the "they have to just put it in the DOM without thinking" is dangerous--leads to them allowing everything to go through--but tables could break editor scenarios... JS sanitiation is usually needed anyway.
<Travis> ack BoCupp
<Travis> BoCupp: The default browser behavior for paste is to just insert it into contenteditable
<Travis> .. agree with johanneswilm that the editor must be written to have a filter (for what can be edited)
<Travis> .. however, it's the current default behavior for paste!
<Travis> .. often we're exposing the internal of the browser... this is one more of those things.
<Travis> .. there is a sanitizer that we employ but we're not giving access to it. Maybe through a sanitizer API with configuration that allows it to work like the paste to contenteditable? (Could be interesting.)
<Travis> ... but we do something today that authors can't get today; additionally we have a web compat concern. So why not leave it?
<snianu> Base: https://github.com/w3c/clipboard-apis/pull/158
<snianu> Pickling: https://github.com/w3c/clipboard-apis/pull/162
<Travis> action: wait for snianu's PRs (base: https://github.com/w3c/clipboard-apis/pull/158 pickling: https://github.com/w3c/clipboard-apis/pull/162). (It re-writes the spec with better algorithmic language. Then there's some pickling work on top of that... Would like folks in the WG to read/review the upcoming PRs.

annevk commented 2 years ago

So part of the problem here that became more clear to me during a meeting is that some browsers have different behaviors for the "legacy" and "new" clipboard APIs. In particular the former does not sanitize and the latter does. It's not clear to me we want to enshrine that difference. Either we should sanitize consistently or not at all. (I could sanitization differing based on the origin (although that's not a fleshed out concept in the specification today), but not based on the API in use. Especially as the APIs do not have names that suggest these semantics.)

rniwa commented 2 years ago

A key feature of sanitization WebKit implements is stripping invisible contents and converting CSS rules to inline styles. Stripping invisible contents is important to avoid pasting privacy sensitive information such as user's home directory path (e.g. may contain the user name of the local computer, or even full mailing address in the case of some Microsoft Office application users). Converting CSS rules to inline style is important so that the pasted content won't adversely affect the rest of the document. We had to preserve some Microsoft Word's list item annotations in order to support TinyMCE's converter code but we consider that to be more of a quirk that we want to get rid of in the long term; e.g. by handling the list item annotations ourselves.

snianu commented 2 years ago

Converting CSS rules to inline style is important so that the pasted content won't adversely affect the rest of the document.

Some native applications and sites like Excel online have a restriction on the HTML payload size, and inlining the styles also bloats the HTML. Maybe copying/pasting small fragments aren't an issue, but if we want to support sites that require high fidelity HTML content (with custom styles etc) inlining may not be ideal. Moreover, there are custom styles being used by sites that get stripped out during this inlining process that breaks functionality. e.g. Custom list styles are written to the clipboard by Word Online to preserve Word specific list semantics.

rniwa commented 2 years ago

Converting CSS rules to inline style is important so that the pasted content won't adversely affect the rest of the document.

Some native applications and sites like Excel online have a restriction on the HTML payload size, and inlining the styles also bloats the HTML. Maybe copying/pasting small fragments aren't an issue, but if we want to support sites that require high fidelity HTML content (with custom styles etc) inlining may not be ideal. Moreover, there are custom styles being used by sites that get stripped out during this inlining process that breaks functionality. e.g. Custom list styles are written to the clipboard by Word Online to preserve Word specific list semantics.

Sure. This is a trade off between that and user privacy and we chose to value user's privacy more than those concerns. This is precisely why I don't think we should be standardizing the same behavior across browsers.

sspi commented 2 years ago

Converting CSS rules to inline style is important so that the pasted content won't adversely affect the rest of the document.

Seems a good idea in the browser's native implementation of "pasting" into an editable element, but not when a script wants to get the content from the clipboard Api. It's up to the js developer to manage these merge / conflict rules.

Maybe you should distinguish between the "sanitize" algorithm (used by the clipboard Api) and the "default paste" algorithm?

snianu commented 2 years ago

Sure. This is a trade off between that and user privacy and we chose to value user's privacy more than those concerns.

Well, broken functionality is worse than user "privacy". Quoting "privacy" because I haven't seen any real-world examples of the privacy risks that you are citing in all your responses. We have discussed this in the Editing WG meetings as well and we are yet to see an example of sites exposing user mailing address. Local file path that you referred to in your earlier response is a temporary folder exposed by Office native apps to support image pasting because converting into base64 encoded image has other issues related to perf, clipboard payload size restriction, delay rendering issue related to resources being available at a later point in time, which I think is orthogonal to this issue.

whsieh commented 2 years ago

[…] Local file path that you referred to in your earlier response is a temporary folder exposed by Office native apps to support image pasting

After some quick testing, this is a snippet of the raw markup saved to NSPasteboard after copying from Microsoft Word for Mac (version 16.53):

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List
href="file:////Users/whsieh/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
<link rel=Edit-Time-Data
href="file:////Users/whsieh/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_editdata.mso">
<!--[if !mso]>

First, it looks like this (at a minimum) leaks the name of the user's home directory. Secondly, it looks like the name of this temporary folder (UBF8T346G9 in my case) persists, even after closing and relaunching Word and copying text again.

The combination of these two pieces of information alone seems like it might uniquely identify a user.

rniwa commented 2 years ago

Sure. This is a trade off between that and user privacy and we chose to value user's privacy more than those concerns.

Well, broken functionality is worse than user "privacy".

That's just your opinion.

Quoting "privacy" because I haven't seen any real-world examples of the privacy risks that you are citing in all your responses. We have discussed this in the Editing WG meetings as well and we are yet to see an example of sites exposing user mailing address. Local file path that you referred to in your earlier response is a temporary folder exposed by Office native apps to support image pasting because converting into base64 encoded image has other issues related to perf, clipboard payload size restriction, delay rendering issue related to resources being available at a later point in time, which I think is orthogonal to this issue.

I've mentioned how to get this time and time again. You can literally launch Microsoft Office products on macOS and copy stuff and see what kind of markup is put into the pasteboard.

I'd emphasize my point once again: Different browser vendors have different priorities and as such, we're not interested in standardizing copy & paste sanitization behavior at this point since we clearly have differing opinion about what the proper level of protection for user's privacy or how much value should be put into the web & app compatibilities.

snianu commented 2 years ago

After some quick testing, this is a snippet of the raw markup saved to NSPasteboard after copying from Microsoft Word for Mac (version 16.53):

Yes, Word and other Office native apps do provide a local image urls, but that doesn't expose user's mailing address. I think we have a separate issue to address this which might alleviate your concern regarding the user's home directory and temp folder.

rniwa commented 2 years ago

I'm done explaining our position here. We are under no obligation to agree to your proposals. We're not gonna change our implementation, and that's the end of this discussion.

css-meeting-bot commented 2 years ago

The Web Editing Working Group just discussed Sanitize clipboard issue.

The full IRC log of that discussion

<Travis> Topic: Sanitize clipboard issue
<Travis> github: https://github.com/w3c/clipboard-apis/issues/150
<Travis> BoCupp: what do we need from the WG on this issue?
<Travis> snianu: We will be taking this discussion to the chromium security group (for chromium only).
<Travis> BoCupp: pastes are unpredictable...
<Travis> .. can't tell if a 3rd party messes with the clipboard.
<Travis> .. the question is whether there's some expected behavior from App to browser...
<Travis> whsieh: We were thinking about adding things to the spec to fix compatibility in the short term. But, if we keep this the same (left to user agent discretion), we are OK with that.
<Travis> BoCupp: We don't think that keeping the spec as-is will be harmful, etc. Propose we can close the issue now.
<Travis> ACTION: BoCupp to close this issue; keep whsieh appraised of the security review (as FYI).

BoCupp-Microsoft commented 2 years ago

Per the log of the last Web Editing WG meeting above, we agreed to leave the details of sanitization up to the UA, i.e. we won't attempt to standardize it at this time. Additionally, to leave room for Safari to implement cross-origin restrictions or whatever policy they would like to use to govern access to the pickled formats (which are also unsanitized) @snianu will land the PR for clipboard pickling without mentioning the usanitized option for ClipboardItems and remove any language that talks about when to sanitize or not or allow access to pickled data or not. I think adding a note saying that those particular behaviors are UA specific would be a good addition in the spec.

We don't think this will affect the developer experience. We do expect that depending on whatever sanitization level the UA chooses to apply will just change the fidelity of the markup available to a paste operation, but developers don't need to write different code and do need to deal with any shape of markup from the clipboard already.

Closing this issue. If you think there needs to be any further discussion please reactivate. Thanks!

annevk commented 2 years ago

Even if we don't define the details of sanitization, which seems like a mistake and something that needs revisiting in due course, I do think we should require the legacy and new API to behave in an equivalent manner. I think it's very surprising that they end up with different results.

BoCupp-Microsoft commented 2 years ago

@annevk I'm not clear on what you're proposing. If we aren't standardizing (at least for now) whether or not to sanitize or what sanitizing is then what are you proposing we should change? Do you just want us to write down that both APIs should be aligned in terms of whatever sanitization steps they choose to apply?

snianu commented 2 years ago

In addition to @BoCupp-Microsoft 's comment above, we also want to set some expectations here. We are NOT going to change anything in the legacy DataTransfer APIs as that would break our first party apps and legacy Win32 Office apps that are still being used in enterprises. If the expectation here is to align async clipboard APIs with the DataTransfer APIs, then I propose to open a new issue as from the comments above it is pretty clear that we are not going to come to an agreement. At least on Windows we want to provide the best developer and user experience. We are not going to compromise on functionality that would worsen the experience on our platform.

annevk commented 2 years ago

@BoCupp-Microsoft right.

@snianu how would moving things to a new issue help matters?

snianu commented 2 years ago

@snianu how would moving things to a new issue help matters?

It probably wouldn't and that is why we decided to close this issue. I'm not sure why you reopened it?

snianu commented 2 years ago

@annevk @mbrodesser Could you please clarify Mozilla's position on providing the sanitization behavior for async read/write methods for well-known formats? From @annevk's proposal here, it sounds like you agree that we shouldn't be sanitizing text/html during read/write as the DataTransfer APIs don't. I think we'd support that proposal as it aligns with our needs. If you agree with that, then I think we can add it to the clipboard API spec since there are two implementers that support this proposal. Thoughts?

w3c / clipboard-apis

Make async clipboard APIs (read/write) to sanitize interoperably with setData/getData for text/html #150