Closed matyaskopp closed 2 years ago
when occurs multiple sentences in the hi
,br
, cell
, the first sentence is included in the element but the rest is outside in udpipe tokenization:
rough list of html elements in source:
1 cente
1 centert
1 centrer
1 cetnter
1 cufon
1 cufontext
1 e
1 embed
1 nav
1 s
1 slot
1 tale
1 title
1 twitterwidget
1 wbr
2 cenetr
2 centr
2 fieldset
2 font
2 head
2 highl
2 left
2 n
2 wide
2 x
4 colgroup
5 header
7 body
8 hr
8 pre
10 centre
10 polyline
10 rect
11 html
12 sub
18 picture
19 sup
23 select
24 col
30 line
36 audio
37 u
39 aside
44 textarea
74 source
80 small
143 ol
360 section
375 form
388 button
464 i
481 h4
554 path
599 meta
749 g
751 label
783 input
823 h1
828 article
853 option
920 time
1037 caption
1046 thead
1183 tbody
1228 table
1348 cite
1577 b
1960 h3
2352 ul
2631 style
3187 link
3748 iframe
3994 <o:p>
8023 th
9346 figcaption
9386 figure
10940 center
12839 li
12856 em
15291 blockquote
16689 tr
21839 script
50529 h2
57007 br
60218 noscript
67212 strong
71197 img
78660 use
78902 svg
105654 td
169563 a
251652 span
328426 div
732865 p
invalid -> not encoding HTML features - we don't need them
<strong>
-><hi rend="bold">
<br>
-><lb/>[SPACE]
<ul>...