kmpoon / hlta

Provides functions for hierarchical latent tree analysis on text data for hierarchical topic detection
GNU General Public License v3.0
81 stars 23 forks source link

(subroute1)text Convert fail due to small scale input text #16

Closed gitathrun closed 4 years ago

gitathrun commented 4 years ago

I have encountered this error when I try to apply the cmd on a directory with only few .txt file with few content.

java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HTD ./mydir testoutput

The error:

[main] INFO tm.hlta.HTD$ - Convert raw text/pdf to .sparse.txt format
[main] INFO tm.text.Convert$ - Reading documents
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 2 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 2 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.hlta.HTD$ - Output file reading order
[main] INFO tm.hlta.HTD$ - Building model
Exception in thread "main" java.lang.NullPointerException
        at clustering.StepwiseEMHLTA.BridgingIslands(StepwiseEMHLTA.java:1214)
        at clustering.StepwiseEMHLTA.FastHLTA_learn(StepwiseEMHLTA.java:520)
        at clustering.StepwiseEMHLTA.IntegratedLearn(StepwiseEMHLTA.java:423)
        at tm.hlta.HLTA$.apply(HLTA.scala:93)
        at tm.hlta.HTD$.main(HTD.scala:203)
        at tm.hlta.HTD.main(HTD.scala)

The error shows the error occurs during tree construction, but it is actually due to the dictionary file is not generated correctly, because the generated file: testoutput.dict.csv and testoutput.sparse.txt are empty which caused the issue, is there any argument that could ensure at least certain amount of words will be added into dictionary?

PS: I check the source code of hlta/src/main/scala/tm/text/Convert.scala, it seems the variable minDocFraction seems to handle the ratio, is it the --min-doc-fraction in the argument list?

PS2: I have tried this argument with 0.1 and 0.2 but still, the xxx.sparse.txt and xxx.dict.csv is empty, any idea why this happens?

(base) D:\my_research\document_topic_modelling\hltm_python_util\hltm_python_util\JARS>java -cp HLTA.jar;HLTA-deps.jar tm.text.Convert -h
Usage: tm.text.Convert [OPTION]... name source max-words concat
  -d, --debug                     Show debug message
      --input-encoding  <arg>     Input text file encoding, default UTF-8, see
                                  java.nio.charset.Charset for available
                                  encodings
  -i, --input-ext  <arg>...       Look for these extensions if a directory is
                                  given, default "txt pdf"
  -l, --language  <arg>           Language, default as English, can be {english,
                                  chinese, nonascii}
      --max-doc-fraction  <arg>   Maximum fraction of documents that a token can
                                  appear to be selected. Default: 0.25
  -m, --min-char  <arg>           Minimum number of characters of a word to be
                                  selected. English default as 3,
                                  Chinese/Nonascii default as 1
      --min-doc-fraction  <arg>   Minimum fraction of documents that a token can
                                  appear to be selected. Default: 0.0
      --output-arff               Additionally output arff format
  -o, --output-hlcm               Additionally output hlcm format
      --output-lda                Additionally output lda format
  -s, --seed-words  <arg>         File containing tokens to be included,
                                  regardless of other selection criteria.
      --show-log-time             Show time in log
      --stop-words  <arg>         File of stop words, default using built-in
                                  stopwords list
  -t, --testset-ratio  <arg>      Split into training and testing set by a user
                                  given ratio. Default is 0.0
  -h, --help                      Show help message

Many thanks!

kmpoon commented 4 years ago

It seems a bit strange that no words are included in the dictionary file. You may try to set the --min-doc-fraction to 0 (which is possibly the default anyway). On the other hand, you may also check whether you can feed your document files to tm.text.Convert properly.

gitathrun commented 4 years ago

@kmpoon, Thanks for the reply, I have tried tm.text.Convert cmd with --debug on but still not work,

the logs:

Execute Java CMD: ['java', '-cp', '.\\JARS\\HLTA.jar;.\\JARS\\HLTA-deps.jar', 'tm.text.Convert', '--debug', '--language', 'english', '--min-doc-fraction', '0.0', 'test', '.\\txt_files', '1000', '1']
[main] INFO tm.text.Convert$ - Finding files under .\txt_files
[main] INFO tm.text.Convert$ - Found 6 files
[main] INFO tm.text.Convert$ - Reading documents
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[scala-execution-context-global-16] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[scala-execution-context-global-17] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: ☺ (U+1, decimal: 1)
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.Convert$ - Text to data conversion is done.
[main] INFO tm.text.Convert$ - Saving in sparse data format (binary data)

This line seems suspicious:

[scala-execution-context-global-17] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: ☺ (U+1, decimal: 1)

But I did more experiment afterwards, which add few more txt files (9 files instead of 6, 3 new txt files with 6 existing files) into the directory and run the cmd again, the Convert works as expected.

Let's say: A text files set: 1.txt, 2.txt .... 6.txt B text files set: A set, + 7.txt, 8.txt, 9.txt

This is really confusing, if there are something untokenable in the A set that breaks the Convert process, the B set should also break the Convert process because the untokenable token is still in the B set, but why the later B set data convert process works fine?

Besides, I checked the B.dict.csv file, the keywords in the A set are all correctly extracted, so, if the B set can be correctly n-gram tokenised and includes the tokens in A set, but why when the Convert.scala applys on A set alone failed?

kmpoon commented 4 years ago

From your log:

[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.

It seems that no tokens (words) could be found in your A set of files. It seemed that tokens could be found after adding the 3 additional files. Given this line in the log:

[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.

There are two possible reasons why no tokens could be found: (1) the words have fewer than 3 characters; or (2) All words appeared in more than 25% of documents. You may change the settings to see if you could find any words in the A set of files.

gitathrun commented 4 years ago

Hi, @kmpoon

Yes, it works after I increased the maxDfFraction to 0.50

[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.5.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 355.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 534.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 424.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.Convert$ - Text to data conversion is done.
[main] INFO tm.text.Convert$ - Saving in sparse data format (binary data)

I think the empty .dict.csv problem is due the low-threshold of maxDfFraction value. When you increased the value, the more tokens you will get in Convert process.

Notice: So, for small scales document corpus, it is better to set the maxDfFraction to a relative higher value, otherwise, it causes the WordSelector filter too much tokens, or even all of them, like my case. For my case, with default maxDfFraction value 0.25, I got 6 documents, so the WordSelector will filter out those tokens that is occur more than *1.5 (60.25)* if a token in my documents occurs more than 1 document, it will be discard. But after I changed to 0.5, the threshold is increased from 1.5 to 3 (60.5), which dramatically increased the number of tokens and make sure the token dictionary file .dict.csv file is not empty for next sub-routine.

Thanks! @kmpoon great help!