ulaval-rs / trombone

GNU General Public License v3.0
0 stars 0 forks source link

Find the tool that allows to list the terms and determine how to use it in a context of large corpus #12

Closed gacou54 closed 2 years ago

gacou54 commented 3 years ago

corpus.DocumentTerms is the tool to get the terms per document. corpus.CorpusTerms is the tool to get the terms of the entire corpus.

Here is an example of how to use corpus.DocumentTerms by running the jar:

 java -jar \
  ./target/trombone-5.2.1-SNAPSHOT-jar-with-dependencies.jar \
  storage=file \
  dataDirectory=./data/data_directory/ \
  tool=corpus.DocumentTerms \
  minRawFreq=100 \
  whiteList=de,the \
  file=./data/raw \
  outputFile=./data/results/output.json

Here is an example of how to use corpus.CorpusTerms by running the jar:

 java -jar \
  ./target/trombone-5.2.1-SNAPSHOT-jar-with-dependencies.jar \
  storage=file \
  dataDirectory=./data/data_directory/ \
  tool=corpus.CorpusTerms \
  minRawFreq=100 \
  whiteList=de,the \
  file=./data/raw \
  outputFile=./data/results/output.json