Closed gitathrun closed 4 years ago
It seems a bit strange that no words are included in the dictionary file. You may try to set the --min-doc-fraction to 0 (which is possibly the default anyway). On the other hand, you may also check whether you can feed your document files to tm.text.Convert properly.
@kmpoon, Thanks for the reply, I have tried tm.text.Convert cmd with --debug on but still not work,
the logs:
Execute Java CMD: ['java', '-cp', '.\\JARS\\HLTA.jar;.\\JARS\\HLTA-deps.jar', 'tm.text.Convert', '--debug', '--language', 'english', '--min-doc-fraction', '0.0', 'test', '.\\txt_files', '1000', '1']
[main] INFO tm.text.Convert$ - Finding files under .\txt_files
[main] INFO tm.text.Convert$ - Found 6 files
[main] INFO tm.text.Convert$ - Reading documents
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[scala-execution-context-global-16] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
[scala-execution-context-global-16] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[scala-execution-context-global-17] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: ☺ (U+1, decimal: 1)
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.Convert$ - Text to data conversion is done.
[main] INFO tm.text.Convert$ - Saving in sparse data format (binary data)
This line seems suspicious:
[scala-execution-context-global-17] WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: ☺ (U+1, decimal: 1)
But I did more experiment afterwards, which add few more txt files (9 files instead of 6, 3 new txt files with 6 existing files) into the directory and run the cmd again, the Convert works as expected.
Let's say: A text files set: 1.txt, 2.txt .... 6.txt B text files set: A set, + 7.txt, 8.txt, 9.txt
This is really confusing, if there are something untokenable in the A set that breaks the Convert process, the B set should also break the Convert process because the untokenable token is still in the B set, but why the later B set data convert process works fine?
Besides, I checked the B.dict.csv file, the keywords in the A set are all correctly extracted, so, if the B set can be correctly n-gram tokenised and includes the tokens in A set, but why when the Convert.scala applys on A set alone failed?
From your log:
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 0.
It seems that no tokens (words) could be found in your A set of files. It seemed that tokens could be found after adding the 3 additional files. Given this line in the log:
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.25.
There are two possible reasons why no tokens could be found: (1) the words have fewer than 3 characters; or (2) All words appeared in more than 25% of documents. You may change the settings to see if you could find any words in the A set of files.
Hi, @kmpoon
Yes, it works after I increased the maxDfFraction to 0.50
[main] INFO tm.text.DataConverter$ - Using the following word selector. Select tokens by TF-IDF. Min characters: 3, minDfFraction: 0.0, maxDfFraction: 0.5.
[main] INFO tm.text.DataConverter$ - Extracting words
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 0 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 355.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Counting n-grams (after 1 concatentations) in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 534.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.DataConverter$ - Replacing constituent tokens by n-grams after 1 concatenations
[main] INFO tm.text.DataConverter$ - Counting n-grams after replacing constituent tokens in each document
[main] INFO tm.text.DataConverter$ - Building Dictionary
[main] INFO tm.text.DataConverter$ - Saving dictionary before selection
[main] INFO tm.text.DataConverter$ - Selecting words in dictionary
[main] INFO tm.text.DataConverter$ - Number of selected tokens is 424.
[main] INFO tm.text.DataConverter$ - Saving dictionary after selection
[main] INFO tm.text.Convert$ - Text to data conversion is done.
[main] INFO tm.text.Convert$ - Saving in sparse data format (binary data)
I think the empty .dict.csv problem is due the low-threshold of maxDfFraction value. When you increased the value, the more tokens you will get in Convert process.
Notice: So, for small scales document corpus, it is better to set the maxDfFraction to a relative higher value, otherwise, it causes the WordSelector filter too much tokens, or even all of them, like my case. For my case, with default maxDfFraction value 0.25, I got 6 documents, so the WordSelector will filter out those tokens that is occur more than *1.5 (60.25)* if a token in my documents occurs more than 1 document, it will be discard. But after I changed to 0.5, the threshold is increased from 1.5 to 3 (60.5), which dramatically increased the number of tokens and make sure the token dictionary file .dict.csv file is not empty for next sub-routine.
Thanks! @kmpoon great help!
I have encountered this error when I try to apply the cmd on a directory with only few .txt file with few content.
The error:
The error shows the error occurs during tree construction, but it is actually due to the dictionary file is not generated correctly, because the generated file:
testoutput.dict.csv
andtestoutput.sparse.txt
are empty which caused the issue, is there any argument that could ensure at least certain amount of words will be added into dictionary?PS: I check the source code of
hlta/src/main/scala/tm/text/Convert.scala
, it seems the variableminDocFraction
seems to handle the ratio, is it the--min-doc-fraction
in the argument list?PS2: I have tried this argument with 0.1 and 0.2 but still, the xxx.sparse.txt and xxx.dict.csv is empty, any idea why this happens?
Many thanks!