(subroute1)text Convert fail due to small scale input text

gitathrun commented 4 years ago

I have encountered this error when I try to apply the cmd on a directory with only few .txt file with few content.

java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HTD ./mydir testoutput

The error:

[main] INFO tm.hlta.HTD$ - Convert raw text/pdf to .sparse.txt format
[main] INFO tm.text.Convert$ - Reading documents
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 2 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 2 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.hlta.HTD$ - Output file reading order
[main] INFO tm.hlta.HTD$ - Building model
Exception in thread "main" java.lang.NullPointerException
        at clustering.StepwiseEMHLTA.BridgingIslands(StepwiseEMHLTA.java:1214)
        at clustering.StepwiseEMHLTA.FastHLTA_learn(StepwiseEMHLTA.java:520)
        at clustering.StepwiseEMHLTA.IntegratedLearn(StepwiseEMHLTA.java:423)
        at tm.hlta.HLTA$.apply(HLTA.scala:93)
        at tm.hlta.HTD$.main(HTD.scala:203)
        at tm.hlta.HTD.main(HTD.scala)

The error shows the error occurs during tree construction, but it is actually due to the dictionary file is not generated correctly, because the generated file: testoutput.dict.csv and testoutput.sparse.txt are empty which caused the issue, is there any argument that could ensure at least certain amount of words will be added into dictionary?

PS: I check the source code of hlta/src/main/scala/tm/text/Convert.scala, it seems the variable minDocFraction seems to handle the ratio, is it the --min-doc-fraction in the argument list?

PS2: I have tried this argument with 0.1 and 0.2 but still, the xxx.sparse.txt and xxx.dict.csv is empty, any idea why this happens?

(base) D:\my_research\document_topic_modelling\hltm_python_util\hltm_python_util\JARS>java -cp HLTA.jar;HLTA-deps.jar tm.text.Convert -h
Usage: tm.text.Convert [OPTION]... name source max-words concat
  -d, --debug                     Show debug message
      --input-encoding  <arg>     Input text file encoding, default UTF-8, see
                                  java.nio.charset.Charset for available
                                  encodings
  -i, --input-ext  <arg>...       Look for these extensions if a directory is
                                  given, default "txt pdf"
  -l, --language  <arg>           Language, default as English, can be {english,
                                  chinese, nonascii}
      --max-doc-fraction  <arg>   Maximum fraction of documents that a token can
                                  appear to be selected. Default: 0.25
  -m, --min-char  <arg>           Minimum number of characters of a word to be
                                  selected. English default as 3,
                                  Chinese/Nonascii default as 1
      --min-doc-fraction  <arg>   Minimum fraction of documents that a token can
                                  appear to be selected. Default: 0.0
      --output-arff               Additionally output arff format
  -o, --output-hlcm               Additionally output hlcm format
      --output-lda                Additionally output lda format
  -s, --seed-words  <arg>         File containing tokens to be included,
                                  regardless of other selection criteria.
      --show-log-time             Show time in log
      --stop-words  <arg>         File of stop words, default using built-in
                                  stopwords list
  -t, --testset-ratio  <arg>      Split into training and testing set by a user
                                  given ratio. Default is 0.0
  -h, --help                      Show help message

Many thanks!

kmpoon commented 4 years ago

It seems a bit strange that no words are included in the dictionary file. You may try to set the --min-doc-fraction to 0 (which is possibly the default anyway). On the other hand, you may also check whether you can feed your document files to tm.text.Convert properly.

gitathrun commented 4 years ago

@kmpoon, Thanks for the reply, I have tried tm.text.Convert cmd with --debug on but still not work,

the logs:

Execute Java CMD: ['java', '-cp', '.\\JARS\\HLTA.jar;.\\JARS\\HLTA-deps.jar', 'tm.text.Convert', '--debug', '--language', 'english', '--min-doc-fraction', '0.0', 'test', '.\\txt_files', '1000', '1']
[main] INFO tm.text.Convert$ - Finding files under .\txt_files
[main] INFO tm.text.Convert$ - Found 6 files
[main] INFO tm.text.Convert$ - Reading documents
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[scala-execution-context-global-16] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[scala-execution-context-global-17] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: ☺ (U+1, decimal: 1)
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.Convert$ - Text to data conversion is done.
[main] INFO tm.text.Convert$ - Saving in sparse data format (binary data)

This line seems suspicious:

[scala-execution-context-global-17] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: ☺ (U+1, decimal: 1)

But I did more experiment afterwards, which add few more txt files (9 files instead of 6, 3 new txt files with 6 existing files) into the directory and run the cmd again, the Convert works as expected.

Let's say: A text files set: 1.txt, 2.txt .... 6.txt B text files set: A set, + 7.txt, 8.txt, 9.txt

This is really confusing, if there are something untokenable in the A set that breaks the Convert process, the B set should also break the Convert process because the untokenable token is still in the B set, but why the later B set data convert process works fine?

Besides, I checked the B.dict.csv file, the keywords in the A set are all correctly extracted, so, if the B set can be correctly n-gram tokenised and includes the tokens in A set, but why when the Convert.scala applys on A set alone failed?

kmpoon commented 4 years ago

From your log:

[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.

It seems that no tokens (words) could be found in your A set of files. It seemed that tokens could be found after adding the 3 additional files. Given this line in the log:

[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.

There are two possible reasons why no tokens could be found: (1) the words have fewer than 3 characters; or (2) All words appeared in more than 25% of documents. You may change the settings to see if you could find any words in the A set of files.

gitathrun commented 4 years ago

Hi, @kmpoon

Yes, it works after I increased the maxDfFraction to 0.50

[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.5.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 355.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 534.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 424.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.Convert$ - Text to data conversion is done.
[main] INFO tm.text.Convert$ - Saving in sparse data format (binary data)

I think the empty .dict.csv problem is due the low-threshold of maxDfFraction value. When you increased the value, the more tokens you will get in Convert process.

Notice: So, for small scales document corpus, it is better to set the maxDfFraction to a relative higher value, otherwise, it causes the WordSelector filter too much tokens, or even all of them, like my case. For my case, with default maxDfFraction value 0.25, I got 6 documents, so the WordSelector will filter out those tokens that is occur more than *1.5 (60.25)* if a token in my documents occurs more than 1 document, it will be discard. But after I changed to 0.5, the threshold is increased from 1.5 to 3 (60.5), which dramatically increased the number of tokens and make sure the token dictionary file .dict.csv file is not empty for next sub-routine.

Thanks! @kmpoon great help!

kmpoon / hlta

(subroute1)text Convert fail due to small scale input text #16