projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
46 stars 21 forks source link

citations are indexed invisibly — `<cite>` is dropped from context field #305

Open sydb opened 1 month ago

sydb commented 1 month ago

[This may be a bug. At least, I do not think it is the result of a mistake I have made, but I have been wrong about that before. :-]

The content of <html:cite> is dropped from the context created for each search term.

To reproduce:

  1. Download & expand 1.4.7.
  2. Add file cite_me_not.html to the test/ directory (file can be found in the appendix of this post).
  3. Issue ant.
  4. Issue fgrep -h 'situational' test/ssTest/stems/* (or otherwise look at the results), and you will notice that the word “citation” does not occur in the output "context": field (it should be in that space before the comma).
  5. Issue cat test/ssTest/stems/citat*, and notice that the word “citation” has no context around it.

Appendix — cite_me_not.html

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      <title>Theft 000382</title>
      <meta name="article type" class="staticSearch_desc" content="test" />
      <meta name="date of publication" class="staticSearch_date" content="2024-05-20" />
      <meta name="volume" class="staticSearch_num" content="18" />
      <meta name="issue" class="staticSearch_num" content="5" />
      <meta name="docTitle" class="staticSearch_docTitle" content="Theft 000382" />
      <script type="text/javascript" src="../../../uvepss/ssHighlight.js"></script>
   </head>
   <body>
     <div id="mainContent">
       This is a division with one firm
       <a href="https://bauman.zapto.org/~syd/temp/pics/some_nice_shots_with_50_mm/index.html">anchor</a>,
       one situational <cite>citation</cite>, one empirically
       <em>emphatic</em> phrase, and <span>22.86 cm</span> worth
       of nonsense.
       <p>
         This is a paragraph with a second firm
         <a href="https://bauman.zapto.org/~syd/temp/pics/2024-04-11_car_fire_4_press/index.html">anchor</a>,
         another situational <cite>citation</cite>, an even more empirically
         <em>emphatic</em> phrase, and <span>⅛ fathom</span> worth
         of nonsense.
       </p>
     </div>
   </body>
</html>
joeytakeda commented 1 month ago

Hi @sydb — the test config file (e.g. configTest.xml) defines <cite> as its own context and so it is excluded from the flow of surrounding contexts (and should yield exactly what you get).

Does your configuration file also have //context[@match='cite'] (e.g. line 42 of the configTest file)?

https://github.com/projectEndings/staticSearch/blob/bdb07a858974079c3ecf866cdf5c32e0d2e94047/configTest.xml#L42

If so, does removing that resolve the issue?

sydb commented 1 month ago

Thank you for the prompt reply, @joeytakeda. So (of course) you are right, that line 42 of configTest.xml causes the <cite> behavior above, and commenting it out “fixes” it. But what is mysterious (to me, at least) is that the config.xml I was using when I first encountered this does not have any <ss:context> that matches <html:cite>, at least not directly. (The only line that matches the string “cite” is <context label="works cited" match="div[ @id eq 'worksCited']"/>.) I will have to poke around a bit to see if there is any other context my cite elements might be matching. Static Search only reads the one config file, right? (It doesn’t also read configTest.xml or something, does it? Are there any built-in default contexts?)

martindholmes commented 1 month ago

@sydb staticSearch only uses one config file, so if it's behaving as though it's reading the test config instead of your config, then it must be doing that, for some reason.

martindholmes commented 1 month ago

@sydb A bit more info on this:

https://endings.uvic.ca/staticSearch/docs/howDoIUseIt.html#specifyingContexts

There are default contexts built into the indexing process, based on the most common HTML block elements, which are listed there. But it's not actually complete, so I'm going to update it; the complete list, as it appears on xsl/tokenize.xsl right now, is this:

body | div | blockquote | p | li | section | article | nav | h1 | h2 | h3 | h4 | h5 | h6 | td | details | summary | table/caption

But it definitely doesn't include cite.