mtmse / pipeline

Super-project that aggregates all Pipeline related code, provides a common tracker for Pipeline related issues and holds the Pipeline website
http://daisy.github.io/pipeline
0 stars 2 forks source link

Sentence detection problems around spans #12

Open martinpub opened 3 years ago

martinpub commented 3 years ago

Using the sentence detection, the following input:

<a href="DTB38216-007-preface.xhtml#p-1">
            <span class="lic">
              <strong>Preface to the First Edition</strong>
            </span> <span class="lic">
              <strong>xi</strong>
            </span>
          </a>

Resulted in the following output:

          <a href="DTB38216-007-preface.xhtml#p-1">
            <span class="lic">
              </span><span id="st5-3" class="sentence"><span class="lic"><strong>Preface to the First Edition</strong>
            </span> <span class="lic">
              <strong>xi</strong></span></span><span class="lic">
            </span>
          </a>

This creates empty spans which makes parsing for talking book production software confusing.

A related question is, do phrases that are identified as sentences but already isolated in a containing element really need extra sentence span elements?

For instance, in <div aria-label="v" role="doc-pagebreak" epub:type="pagebreak" title="v" id="page-v" class="page-front"><span id="st5-2" class="sentence">v</span></div>, or in headings, it is not necessary needed to add specific markup for syncing, perhaps the detector could add an ID to the parent element if missing, but nothing else?

Let me know what you think @bertfrees. Not sure how the sentence detection is constructed, if e.g. TTS needs all these sentence spans even if they are somewhat redundant, markup-wise.

bertfrees commented 3 years ago

Hi Martin. I've occasionally seen white space only spans too. I guess the sentence detection should ideally aim to split up as few elements as possible. One might even wish to avoid splitting certain elements altogether.

I don't know if this is an easy fix. I had to deal with a similar problem when generating section elements for implied sections in an HTML document, and that wasn't so easy.

Actually, in this case I would say the best result would be to limit the scope of sentences to the boundaries of lic elements. Maybe you should consider using another element than span for lic? A div?

To answer your second question: No, sentences do not really need to be wrapped inside a span if there is only one sentence within the containing element. It's just a side effect of the current implementation. This one is probably easier to fix.

martinpub commented 3 years ago

Ideally, this issue involves enhancements to avoid sentence span redundancy/stacking in addition to the more specific example mentioned.

kalaspuffar commented 3 years ago

Hi @martinpub

What is the blocking part of this issue for your production? Does the production software fail on empty tags, so we need to remove these? And if so, could the software crash or work badly with other tags, so we need to have a cleanup step for other empty tags too?

Best regards Daniel

martinpub commented 3 years ago

Hi @kalaspuffar. Currently, it seems the production software (at least Hindenburg, which is one out of two production softwares used) can handle this, but I think the developer had to make extra adjustments to support the output in the example above.

Splitting up the synchronisation in separate lic elements is not supported however, but it is partly related to the fact that the link element spans (hehe) all of the contents of the list item in this case.

<li>
  <span class="lic">
    <span id="st6-3" class="sentence"><strong>Preface to the First Edition</strong></span>
  </span>
  <span class="lic">
    <span id="st6-blahonga" class="sentence"><strong>xi</strong></span>
  </span>
</li>

works, well, and I think

<li>
<a href="DTB38216-007-preface.xhtml#p-1">
  <span class="lic">
    <span id="st6-3" class="sentence"><strong>Preface to the First Edition</strong></span>
  </span>
  <span class="lic">
    <span id="st6-blahonga" class="sentence"><strong>xi</strong></span>
  </span>
</a>
</li>

would also be appropriate, even if the parent a element currently breaks the possibility to synchronise on lic level in this case.

I can understand that the appropriate logic to apply is hard to get at, perhaps we should have more examples?

To sum up, to be more exact, the sentence detection does not currently break synchronisation/production in at least Hindenburg. Currently I'm not sure about the other system. However, the current situation breaks validity according to the Nordic Guidelines ([nordic251]/[nordic09]), so the issue is also there.

bertfrees commented 3 years ago

It is going to be complicated to find a general solution for this problem. My advice is to go for a simpler yet effective solution for the time being, namely what I suggested earlier: to limit the scope of sentences to the boundaries of certain elements. For the "lic" spans this absolutely makes sense. I would rename them to div, that is by far the simplest option. If you can not rename them, we have to find another way.

martinpub commented 3 years ago

Thanks for your input @bertfrees. As long as the Nordic Guidelines require span class=lic, we need to support them.

I.e. we need Pipeline to work with the spans currently, even though we can strive to make the markup better in future versions of the Nordic Guidelines. div would indeed be more appropriate, since it is a block element, so I understand your concern @bertfrees.

martinpub commented 3 years ago

Hi @kalaspuffar, and thanks for your work on this one.

I have tested and run into empty sentences added, like this (sentence st5-4 is not a sentence):

          <a href="DTB38216-007-preface.xhtml#p-1">
            <span class="lic"><span id="st5-3" class="sentence"><strong>Preface to the First Edition</strong>
            </span></span><span id="st5-4" class="sentence"> </span><span class="lic"><span id="st5-5" class="sentence">
              <strong>xi</strong></span></span>
          </a>

The source was this:

<li>
          <a href="DTB38216-007-preface.xhtml#p-1">
            <span class="lic">
              <strong>Preface to the First Edition</strong>
            </span> <span class="lic">
              <strong>xi</strong>
            </span>
          </a>
        </li>

The whitespace before the second lic should be preserved, but it would be good to get rid of it being detected as a sentence.

Other than that, I think this works. However, I remember last week to have spotted some empty strongs, but I cannot find them now. So until we have been able to reproduce this, please disregard that comment.

martinpub commented 3 years ago

(By the way, from a production perspective, the above forces the narrator to always have to skip an empty synchronisation point while recording.)

bertfrees commented 3 years ago

I think you'll get better results if instead of doing

cannot-be-sentence-child="html:span[@class='lic']"

you exclude span[@class='lic'] from inline-tags. This will limit the scope of sentences right from the start so that no cleanup step is required afterwards. (The latter happens in the "px:reshape" step and that code is clearly not so great.)