shangjingbo1226 / AutoPhrase

AutoPhrase: Automated Phrase Mining from Massive Text Corpora
Apache License 2.0
1.17k stars 273 forks source link

"POS file doesn't have enough POS tags" #66

Closed MTKnife closed 4 years ago

MTKnife commented 4 years ago

Running AutoPhrase in a Docker container, I'm having an issue similar to the one described in closed issue https://github.com/shangjingbo1226/AutoPhrase/issues/46.

I've checked tools/treetagger/download_parameter_files.sh in the Docker download zip file, and confirmed that it's now got the right address for downloading the English POS tagger, but I'm still getting the issue. Specifically, the tail end of my output looks like this:

POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
# of documents = 3696848
# of distinct POS tags = 1
Mining frequent phrases...
selected MAGIC = 1310293
# of frequent phrases = 2892252
Extracting features...
Constructing label pools...
        The size of the positive pool = 53349
        The size of the negative pool = 2828877
# truth patterns = 272853
Estimating Phrase Quality...
Segmenting...
Rectifying features...
Estimating Phrase Quality...
Segmenting...
Dumping results...
Done.

real    216m23.897s
user    132m23.490s
sys     48m58.630s
===Saving Model and Results===
===Generating Output===

Aside from the POS problem, terminal output seems normal (I don't recall seeing the ERROR: not a parameter file: ./lib/english-utf8.par! mentioned by the earlier reporter, but all the POS warning exceed my scroll buffer, so I can't doublecheck). However, after the completion of execution, the results folder is empty.

A few notes on specifics:

I'm running the latest version of Docker (2.1.0.4) in Windows 10. The provided make.sh script did not work correctly under Windows in a MinGW64 shell (it chokes on the wget for some reason, but anyway the filename is wrong--master.zip in the URL is correct, but the file is actually downloaded as Autophrase-master.zip, though using that name in the URL produces a 404) , but fixing that was trivial. However, I wasn't able to build the Docker container with the included Dockerfile. After several hours of trying to modify it to install OpenJava (I need to use this in production, so Oracle Java isn't an option, and anyway, that wouldn't install, either), I finally gave up and moved from the debian:jessie parent container to the openjdk:8 container provided by the OpenJava people, which uses Debian Stretch. My Dockerfile thus skips the Java install, but uses everything else in the original, including the g++ installation. I don't know if OpenJava is causing the problem--but it looks like other people are using it as well--or perhaps the transition from Jessie to Stretch could be causing a problem as well, but I'm at a loss.

Incidentally, when working with MinGW, the run command given in the README won't supply the proper paths for directory mounts. You need something like this instead, or you get a ";C" appended to the end of the working directory path:

winpty docker run -v "/${PWD}/data":/autophrase/data -v "/${PWD}/results":/autophrase/results -it -e RAW_TRAIN=data/corpus.txt -e FIRST_RUN=1 -e ENABLE_POS_TAGGING=1 -e MIN_SUP=30 -e THREAD=3 remenber1/autophrase
MTKnife commented 4 years ago

Running the script again, I've confirmed this is different from issue https://github.com/shangjingbo1226/AutoPhrase/issues/64: in my case, I didn't get an error about an incorrect English POS parameter file. However, I did get this error, 5 times in a row: cmd/tree-tagger-english: 20: cmd/tree-tagger-english: ./bin/tree-tagger: Permission denied. One obvious potential issue is that that one particular file, unlike all the others, is owned by root. I used chown and chgrp to change its ownership, but that didn't fix the problem. Also, I'm not sure why this one file has a different owner in the first place: maybe because it's created by the run script, which is, by default, run as root?

Next I tried setting permissions to 775 (755 should work just as well), thinking maybe the problem was the file wasn't executable. That got me past that error (in fact, after I restored ownership to root later, it still worked fine). However, then I got this:

===Saving Model and Results===
cp: cannot stat 'tmp/segmentation.model': No such file or directory
===Generating Output===
java.io.FileNotFoundException: tmp/final_quality_multi-words.txt (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at Tokenizer.tokenizeText(Tokenizer.java:713)
        at Tokenizer.main(Tokenizer.java:864)
java.io.FileNotFoundException: tmp/final_quality_unigrams.txt (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at Tokenizer.tokenizeText(Tokenizer.java:713)
        at Tokenizer.main(Tokenizer.java:864)
java.io.FileNotFoundException: tmp/final_quality_salient.txt (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at Tokenizer.tokenizeText(Tokenizer.java:713)
        at Tokenizer.main(Tokenizer.java:864)

This appears to be the closed but not exactly resolved issue https://github.com/shangjingbo1226/AutoPhrase/issues/34. But then I noticed this in the output in the section preceding the one I quoted above:


=======
Loading data...
# of total tokens = 299743372
max word token id = 1310326
# of documents = 3696848
# of distinct POS tags = 57
Mining frequent phrases...
selected MAGIC = 1310327
[Warning] failed to open data/BAD_POS_TAGS.txt under parameters = r
./auto_phrase.sh: line 108:   384 Segmentation fault      ./bin/segphrase_train --pos_tag --thread $THREAD --pos_prune data/BAD_POS_TAGS.txt --label_method $LABEL_METHOD --label $LABEL_FILE --max_positives $MAX_POSITIVES --min_sup $MIN_SUP

real    3m58.528s
user    3m51.850s
sys     0m3.760s

Evidently, the later error was caused by the absence of a file that was supposed to be output by bin/segphrase_train. Weirdly, the warning and subsequent error disappeared next time I ran the script...a transient error of some sort? Or somehow caused by changing the ownership of tools/treetagger/bin/tree-tagger away from root? The latter seems unlikely--I think it was just a coincidence.

In that last run everything seemed to run just fine, but, despite a prominent green message alerting me to the fact that it was dumping the results, nothing was saved in results. After some poking around, I discovered they were instead saved in the value of the MODELS variable, which is, by default, models/DBLP. I suppose that's fine, but then why have a results directory in the first place? The obvious answers is that results is mapped from the host machine's file structure, making the results available outside the Docker container...which they aren't, because they're being saved in a directory internal to the container.

In conclusion, we have the following issues:

  1. The permissions on tools/treetagger/bin/tree-tagger need to be changed.
  2. Either the results need to be saved in the results directory (or copied there), or that directory shouldn't be mounted by the docker run command.
  3. Their may be a transient error involving data/BAD_POS_TAGS.txt.

I'm going to have a look and see if I can put in a pull request. Number 3 is trivial to fix; I'm not sure how to do number 1, but it can't be that hard. Of course, someone who have to be reading to approve a pull request...no one seems to have been monitoring the project since September.

MTKnife commented 4 years ago

Found it....compile.sh, which gets called by the Dockerfile in a Docker deployment (or manually in a plain Linux deployment), copies one of several OS-specific tree-tagger files to tools/treetagger/bin/tree-tagger, which explains why the file in question is owned by root. Why that causes a problem in Docker but not a normal Linux deployment, I don't know, but it's easy enough to fix in the Dockerfile by adding a chmod right after the call to compile.sh.

I've just made changes that address all these issues, and also provide tips for running a Docker in Windows. I'll push the branch to my fork (https://github.com/MTKnife/AutoPhrase), then create a PR after I've tested it.

MTKnife commented 4 years ago

OK, after much frustrating futzing around, I've got a version that works. I issued a PR, but I'm not sure how long it'll take someone to approve it. In the meantime, you can get the new version at https://github.com/MTKnife/AutoPhrase.

Yevgnen commented 4 years ago

I'm running into the same issue while outside docker.

remenberl commented 4 years ago

We will investigate the problem brought by POS tagging. One way to work around as a quick solution is to set ENABLE_POS_TAGGING as 0 in the script.

YerongLi commented 4 years ago

Any update on this isse?

pauloamed commented 2 years ago

Probably was because the download links were outdated.