inidun / text_analytics

Text analytic tools
3 stars 0 forks source link

Word trends notebook fails to execute after corpus update #30

Open roger-mahler opened 2 years ago

roger-mahler commented 2 years ago

The word trends module fails to load correct document index. The root cause is a conflict between settings in the corpus config files (SSI and SSI-non-regional), and the corpus folder set by the notebook it self.

roger-mahler commented 2 years ago

A temporary fix is to replace the content in the first notebook cell with the code below. Note that this is only needed when you want to create a new DTM in the "...OR COMPUTE A NEW DOCUMENT-TERM MATRIX" tab. The "LOAD AN EXISTING DOCUMENT-TERM MATRIX" (i.e. the word trends viewer) is not affected by this bug,

Steps to create a new DTM:

  1. Replace the first code cell with the code listed below.
  2. Select corpus you want to use with the NON_REGIONAL flag on the first line of code. Chnage first line to NON_REGIONAL = True for non-regional corpus or NON_REGIONAL = False for regional corpus.
  3. Run the cell.
  4. Do not change corpus when creating the DTM i.e. leave "Source corpus file" as is, The code below selects the correct corpus based on value of flag NON_REGIONAL.
  5. Create DTM as as before.
NON_REGIONAL = False

from bokeh.plotting import output_notebook
from IPython.core.display import display
from penelope.notebook.word_trends import main_gui
import ipywidgets as w

import __paths__  # pylint: disable=unused-import

output_notebook()

data_folder: str = "/data/inidun/shared"

if NON_REGIONAL:
    corpus_folder: str = "/data/inidun/shared/corpus/ssi_nonregional"
    config_tag: str = "SSI-nonregional"
else:
    corpus_folder: str = "/data/inidun/shared/corpus/ssi_regional"
    config_tag: str = "SSI"

gui = main_gui.create_to_dtm_gui(
    corpus_folder=corpus_folder,
    data_folder=data_folder,
    corpus_config=config_tag,
    resources_folder=__paths__.resources_folder,
)
display(gui)

This bug will be fixed in the next release of INIDUN notebooks.