ufal / media-irozhlas

0 stars 0 forks source link

features to encode in TEI #1

Closed matyaskopp closed 2 years ago

matyaskopp commented 3 years ago
matyaskopp commented 3 years ago

when occurs multiple sentences in the hi,br, cell, the first sentence is included in the element but the rest is outside in udpipe tokenization: image

matyaskopp commented 3 years ago

rough list of html elements in source:

      1 cente
      1 centert
      1 centrer
      1 cetnter
      1 cufon
      1 cufontext
      1 e
      1 embed
      1 nav
      1 s
      1 slot
      1 tale
      1 title
      1 twitterwidget
      1 wbr
      2 cenetr
      2 centr
      2 fieldset
      2 font
      2 head
      2 highl
      2 left
      2 n
      2 wide
      2 x
      4 colgroup
      5 header
      7 body
      8 hr
      8 pre
     10 centre
     10 polyline
     10 rect
     11 html
     12 sub
     18 picture
     19 sup
     23 select
     24 col
     30 line
     36 audio
     37 u
     39 aside
     44 textarea
     74 source
     80 small
    143 ol
    360 section
    375 form
    388 button
    464 i
    481 h4
    554 path
    599 meta
    749 g
    751 label
    783 input
    823 h1
    828 article
    853 option
    920 time
   1037 caption
   1046 thead
   1183 tbody
   1228 table
   1348 cite
   1577 b
   1960 h3
   2352 ul
   2631 style
   3187 link
   3748 iframe
   3994 <o:p>
   8023 th
   9346 figcaption
   9386 figure
  10940 center
  12839 li
  12856 em
  15291 blockquote
  16689 tr
  21839 script
  50529 h2
  57007 br
  60218 noscript
  67212 strong
  71197 img
  78660 use
  78902 svg
 105654 td
 169563 a
 251652 span
 328426 div
 732865 p
matyaskopp commented 2 years ago

invalid -> not encoding HTML features - we don't need them