[Bug] Break Line (\n) added in replacement of HTML tags in quoted text

Steps to reproduce

I've highlighted an HTML text, and the raw version of the quoted text in the annotation contains incorrect characters substitution with apparently all intermediary HTML elements of the selection replaced by a \n character.

The selected text:

<p class="MsoNormal">The OPDS Page Streaming Extension (OPDS-PSE) is an <b><i>unofficial</i></b>
extension of the <a href="http://opds-spec.org/specs/opds-catalog-1-1-20110627/">Open
Distribution Publication System</a>. Its goal is to enrich the OPDS feed with
information allowing the client to request a specific page of a document
without having to download it completely.<br>
This extension was designed primarily for comic books, to allow reading them on
connected devices without having to wait for the book to be completely
downloaded.</p>

The raw version of the text returned by the API:

{
  "type": "TextQuoteSelector",
  "exact": "The OPDS Page Streaming Extension (OPDS-PSE) is an unofficial\nextension of the Open\nDistribution Publication System. Its goal is to enrich the OPDS feed with\ninformation allowing the client to request a specific page of a document\nwithout having to download it completely.\nThis extension was designed primarily for comic books, to allow reading them on\nconnected devices without having to wait for the book to be completely\ndownloaded.",
  "prefix": "-Share Alike.\n\nIntroduction\n\n \n\n",
  "suffix": "\n\nOPDS-PSE adds new attributes a"
 }

Sample: unofficial\nextension

The quoted text is then displayed this way in HTML, using the CSS property pre-line:white-space :

<blockquote style="pre-line:white-space">
The OPDS Page Streaming Extension (OPDS-PSE) is an unofficial
extension of the Open
Distribution Publication System. Its goal is to enrich the OPDS feed with
information allowing the client to request a specific page of a document
without having to download it completely.
This extension was designed primarily for comic books, to allow reading them on
connected devices without having to wait for the book to be completely
downloaded.
</blockquote>

It's a bug in the selection or in process of the selected text, isn't it ?

The new lines you mention really are present in the text content of the element. HTML tags are not being replaced by new lines, they just get omitted entirely. If you look at the textContent property of the  element you selected in the browser console, and you'll see the same new lines. Also if you select the text and run window.getSelection().getRangeAt(0).toString() in the browser console you'll see the same new lines.

In summary, this is working as it is currently expected to. What I think may have been surprising here is that the captured text is not the same as what would be copied to the clipboard. When copying to the clipboard, new lines in the source get replaced with spaces, and   tags get converted to new lines. Browser specifications distinguish the original text content of HTML "in the source" as returned by element.textContent from the text content "as rendered" returned by element.innerText. Hypothesis has always captured quotes from and searched for quotes in the "source" text content rather than the "rendered" text. This behavior causes issues with line breaks as well.

It might make sense for us to look at capturing the rendered text (as copied to the clipboard) rather than the source text in future. We'd need to be careful to handle all the places where this distinction comes up, and also make sure that all existing annotations anchor properly. Also we should talk to other parties interested in the Web Annotations specifications to discuss how this impacts interoperability.

I'm going to close this issue as "working as expected". If there is a practical problem that this behavior is creating, as opposed to just not being what you expected to see, please feel free to comment here and I can look at creating a separate issue or expanding https://github.com/hypothesis/client/issues/3692.

The new lines you mention really are present in the text content of the element. HTML tags are not being replaced by new lines, they just get omitted entirely. If you look at the textContent property of the  element you selected in the browser console, and you'll see the same new lines. Also if you select the text and run window.getSelection().getRangeAt(0).toString() in the browser console you'll see the same new lines.

Thank you very much for the thorough explanation.

I'm going to close this issue as "working as expected". If there is a practical problem that this behavior is creating, as opposed to just not being what you expected to see, please feel free to comment here and I can look at creating a separate issue or expanding https://github.com/hypothesis/client/issues/3692.

Alright.

From a user consuming data via the API, at first glance, there's nothing that could be really done to display the quote in a fashionable manner.

Apparently, \r\n\ are not captured, so it's not possible to selectively apply regex with them.

And the only practical solution is either to keep \n as is, with content unexpectedly broken in the display, or to replace all break lines with spaces.

That's a bit of a pity.

P.S : Talking about highlighting, I've quoted your answer this way:

<blockquote>What I think may have been surprising here is that the captured text is not the same as what would be copied to the clipboard. When <mark>copying to the clipboard, <strong><mark style="background-color: #8000314f">new lines in the source</mark> get <mark style="background-color:#00800030">replaced with spaces</mark></strong>, and <br> tags get converted to new lines</mark>.
</br>
<mark>Browser specifications distinguish <strong><mark style="background-color: #00800036">the original text content of HTML "in the source"</mark> as returned by <mark style="background-color: #00800036"/>element.textContent</mark></strong> from <strong><mark style="background-color: #ffa500a1">the text content "as rendered" returned by element.innerText.</mark></strong></mark> Hypothesis has always captured quotes from and searched for quotes in the "source" text content rather than the "rendered" text.</blockquote>

And it looks quite neat (well, apart from the choice of colors) displayed with highlights

Having come across the Open Annotation in EPUB spec, there's an example of text selection (highlights) with CSS style applied :

Appendix A - Examples : 11. Styling of Selection

{
  "@context": "http://www.idpf.org/epub/oa/1.0/context.json",
  "@id": "http://example.org/epub/annotations.json",
  "@type": "epub:AnnotationCollection",
  "annotations": [
    {
      "@id": "urn:uuid:E7E3799F-3CD5-4F69-87C6-5478B22873D6",
      "@type": "oa:Annotation",
      "styledBy": {
        "@type": "oa:CssStyle",
        "format": "text/css",
        "chars": ".red { border: 1px solid red; }"
      }, 
      "hasTarget": {
        "@type": "oa:SpecificResource",
        "hasSelector": {
          "@type": "oa:FragmentSelector",
          "value": "epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)"
        },
        "hasSource": {
          "@type": "dctypes:Text",
          "uniqueIdentifier": "isbn:123456789x",
          "originURL": "http://www.example.com/publisher/book/",
          "dc:identifier": "urn:uuid:A1B0D67E-2E81-4DF5-9E67-A64CBE366809",
          "dcterms:modified": "2011-01-01T12:00:00Z"
        },
        "styleClass": "red"
      },
      "hasBody": {
        "@type": "dctypes:Text",
        "format": "application/xhtml+xml",
        "chars": "<div xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>I love this part of the text</div>",
        "language": "en"
      },
      "motivatedBy": "oa:commenting"
    }
  ]
}

That would be awesome to be able to highlight a snippet of text with multiple . There could be some kind of highlighter button which would capture text selections and let us apply specific colors per highlights. IMHO, there's lot of room of improvement for highlights and their display. A kind of subliminal feature request. :smiley:

Cheers !

From a user consuming data via the API, at first glance, there's nothing that could be really done to display the quote in a fashionable manner.

By "fashionable" do you mean "preserving line breaks" (as created by   elements)?

By "fashionable" do you mean "preserving line breaks" (as created by   elements)?

Yes, like in the source page instead of:

The OPDS Page Streaming Extension (OPDS-PSE) is an unofficial
extension of the Open
Distribution Publication System. Its goal is to enrich the OPDS feed with
information allowing the client to request a specific page of a document
without having to download it completely.
This extension was designed primarily for comic books, to allow reading them on
connected devices without having to wait for the book to be completely
downloaded.

It's ugly but it's the best I can get, otherwise the text is wrapped and it becomes difficult to distinguish content, specially paragraphs.

Even the highlight of your answer lacks the   (using react-markdown-editor):

Screenshot 2022-05-25 at 14-17-24@kael

N.B: I'm using CSS white-space:pre-line otherwise all text is wrapped and it looks like this:

Screenshot 2022-05-25 at 14-31-42 @kael

Is there some regex that would mitigate this ?

It's ugly but it's the best I can get, otherwise the text is wrapped and it becomes difficult to distinguish content, specially paragraphs.

I understand. Fundamentally you want the text "as rendered" or "as copied" rather than "as in the source". This could be seen as an addition to https://github.com/hypothesis/client/issues/3692, as it would involve capturing paragraphs in addition to   tags as new lines. In the meantime, the only solution I can see is to fetch the original HTML from the annotated URL and find the quote in the source, so that you can reconstruct "rich" aspects of the quoted text (formatting, new lines, paragraph breaks etc.). The client's code in src/annotator/anchoring may help with this. I'm afraid we don't have it available as a library.

In the meantime, the only solution I can see is to fetch the original HTML from the annotated URL and find the quote in the source, so that you can reconstruct "rich" aspects of the quoted text (formatting, new lines, paragraph breaks etc.).

I was actually thinking of fetching the content of pages for "indirect content annotation" like annotating a RSS feed entry with the selection extracted on the "remote" original source based on the selection of the "local" RSS entry content. So it might make sense to use it also for matches of highlights selections, but a bit expensive perhaps for each annotation - need to assess it.

The client's code in src/annotator/anchoring may help with this. I'm afraid we don't have it available as a library.

Thank you very much. You gave me lots of very useful infos. :+1: And I keep diving into the code time to time. :smiley:

Cheers

I've noticed a difference for the same highlighted content, with one version containing a - random ? - \n\, with different RangeSelector and TextPositionSelector, and almost the same TextQuoteSelector (the highlighted content is an HN post from yesterday):

You can see the difference in the TextQuoteSelector exact value.

Highlighted content with \n:

"target": [
{
  "source": "https://news.ycombinator.com/item?id=33504467",
  "selector": [
    {
      "type": "RangeSelector",
      "endOffset": 222,
      "startOffset": 64,
      "endContainer": "/center[1]/table[1]/tbody[1]/tr[3]/td[1]/table[1]/tbody[1]/tr[1]/td[3]/div[2]/span[1]",
      "startContainer": "/center[1]/table[1]/tbody[1]/tr[3]/td[1]/table[1]/tbody[1]/tr[1]/td[3]/div[2]/span[1]"
    },
    {
      "end": 643,
      "type": "TextPositionSelector",
      "start": 485
    },
    {
      "type": "TextQuoteSelector",
      "exact": "It seems to me that they hide behind\n progressive social stances and say \"see we're nice guys!\" when they're \njust as rude and indignant than any other group.",
      "prefix": "ile community most of the time. ",
      "suffix": " I think it goes \nhand-in-hand w"
    }
  ]
}
],

Highlighted content without \n - a test I've just run using Via with Chrome 107 (can't use the bookmarklet with HN):

"target": [
{
  "source": "https://news.ycombinator.com/item?id=33504467",
  "selector": [
    {
      "type": "RangeSelector",
      "endOffset": 156,
      "startOffset": 0,
      "endContainer": "/center[1]/table[1]/tbody[1]/tr[3]/td[1]/table[1]/tbody[1]/tr[1]/td[3]/div[2]/span[1]/hypothesis-highlight[1]/hypothesis-highlight[1]/hypothesis-highlight[1]",
      "startContainer": "/center[1]/table[1]/tbody[1]/tr[3]/td[1]/table[1]/tbody[1]/tr[1]/td[3]/div[2]/span[1]/hypothesis-highlight[1]/hypothesis-highlight[1]/hypothesis-highlight[1]"
    },
    {
      "end": 638,
      "type": "TextPositionSelector",
      "start": 482
    },
    {
      "type": "TextQuoteSelector",
      "exact": "It seems to me that they hide behind progressive social stances and say \"see we're nice guys!\" when they're just as rude and indignant than any other group.",
      "prefix": "ile community most of the time. ",
      "suffix": " I think it goes hand-in-hand wi"
    }
  ]
}
],

BTW, the RangeSelector paths contain hypothesis-highlight elements, is this normal ?

I've noticed a difference for the same highlighted content, with one version containing a - random ? - \n\, with different RangeSelector and TextPositionSelector, and almost the same TextQuoteSelector (the highlighted content is an HN post from yesterday):

Following up, the original highlighted content is not processed the same way by Chrome and Firefox, though.

These are excerpts from the HTML pages of the HN post saved on disk:

Chrome:

<div class="comment">
              <span class="commtext c00">I maintain Rust has a very hostile community most of the time. It seems to me that they hide behind progressive social stances and say "see we're nice guys!" when they're just as rude and indignant than any other group. I think it goes hand-in-hand with the intellectual-superiority-claims that <i>other</i> code is unsafe, but <i>our</i> code is safe because we're Rust-people.<p>It is a real shame, because the language itself is pretty cool.
          </p><div class="reply"></div></span></div>

Firefox:

<div class="comment">
              <span class="commtext c00">I maintain Rust has a very 
hostile community most of the time. It seems to me that they hide behind
progressive social stances and say "see we're nice guys!" when they're 
just as rude and indignant than any other group. I think it goes 
hand-in-hand with the intellectual-superiority-claims that <i>other</i> code is unsafe, but <i>our</i> code is safe because we're Rust-people.<p>It is a real shame, because the language itself is pretty cool.
          </p><div class="reply"></div></span></div>

There's also an ancient ticket opened on Bugzilla about line breaks not being stripped by Firefox on copy/paste, apparently specifically when the content is served as application/xhtml+xml.

Firefox selection would be buggy ?

hypothesis / client

[Bug] Break Line (\n) added in replacement of HTML tags in quoted text #4532

Steps to reproduce

Appendix A - Examples : 11. Styling of Selection