psal / JStylo-Anonymouth

Java-based Authorship Recognition Analysis and Evasion Tools
http://psal.cs.drexel.edu/
Other
61 stars 13 forks source link

Takes an extremely long time to run #1

Open dan-blanchard opened 12 years ago

dan-blanchard commented 12 years ago

I'm trying to process a set of 10,000 files using JStylo (with 12 possible authors), and it takes an extremely long time to generate the features (using the WriteLimits set). I've had it running for over three weeks, and unfortunately I now have to restart the process because I realized there was some metadata in the files that I did not want there.

Anyway, is there any advice you can offer on how to speed things up?

According to top, it seems that only one thread was running for most of the past three weeks, so are you planning on making the feature-generation code more multithreaded in the future?

sheetal57 commented 12 years ago

I think you got out of memory exception. Did you check your terminal? It shouldn't take that long.

On Wed, Jun 6, 2012 at 11:16 AM, Dan Blanchard < reply@reply.github.com

wrote:

I'm trying to process a set of 10,000 files using JStylo (with 12 possible authors), and it takes an extremely long time to generate the features (using the WriteLimits set). I've had it running for over three weeks, and unfortunately I now have to restart the process because I realized there was some metadata in the files that I did not want there.

Anyway, is there any advice you can offer on how to speed things up?

According to top, it seems that only one thread was running for most of the past three weeks, so are you planning on making the feature-generation code more multithreaded in the future?


Reply to this email directly or view it on GitHub: https://github.com/psal/JStylo-Anonymouth/issues/1

Sadia

dan-blanchard commented 12 years ago

There are no messages in the terminal, and I've also allocated 40 gigs of RAM for it. I can also tell that it's definitely still running with htop because the amount of CPU it's using changes over time, as well as the threads that are doing the work.

I did have to run it with -XX:-UseGCOverheadLimit to prevent "GC overhead limit exceeded" errors though.

arielstolerman commented 12 years ago

It's weird there are no messages in the terminal... It should spill a log automatically as you run. I suggest you first try out a subset with a lot less files to see how it behaves, preferably with a small feature set like the basic-9. In addition make sure that in the analysis tab the following options are UNCHECKED:

I suggest you try out using the JStylo API in Java, to attain better control of the process. Concurrency is planned for future development, however currently we're focused on adding more unique classifiers (like Writeprints).

dan-blanchard commented 12 years ago

I'm sorry; I didn't mean to give the impression that there were absolutely no messages on the terminal, I just meant there was no "out of memory" error or anything like that. The last two things it prints after "Extracting features from trained corpus" are: Loading default properties from trained tagger com/jgaap/resources/models/postagger/english-left3words-distsim.tagger Reading POS tagger model from com/jgaap/resources/models/postagger/english-left3words-distsim.tagger ... done [1.6 sec]. Loading default properties from trained tagger com/jgaap/resources/models/postagger/english-left3words-distsim.tagger Reading POS tagger model from com/jgaap/resources/models/postagger/english-left3words-distsim.tagger ... done [1.0 sec].

I have tried it with fewer files and features and it worked fine, which is what encouraged me to try our larger dataset.

I'll try it again with "output feature vectors" and "calculate information gain" turned off.

sheetal57 commented 12 years ago

Are there any messages after that? If not, then it's a problem of the pos tagger. Sometimes if your document has special characters, the pos tagger gets stuck for infinite time. One way around for you now is to check if your document has any weird characters.

In future we will add other pos taggers or just abort the tagging process for words on which it gets stuck.

On Thu, Jun 7, 2012 at 7:01 AM, Dan Blanchard < reply@reply.github.com

wrote:

I'm sorry; I didn't mean to give the impression that there were absolutely no messages on the terminal, I just meant there was no "out of memory" error or anything like that. The last two things it prints after "Extracting features from trained corpus" are: Loading default properties from trained tagger com/jgaap/resources/models/postagger/english-left3words-distsim.tagger Reading POS tagger model from com/jgaap/resources/models/postagger/english-left3words-distsim.tagger ... done [1.6 sec]. Loading default properties from trained tagger com/jgaap/resources/models/postagger/english-left3words-distsim.tagger Reading POS tagger model from com/jgaap/resources/models/postagger/english-left3words-distsim.tagger ... done [1.0 sec].

I have tried it with fewer files and features and it worked fine, which is what encouraged me to try our larger dataset.

I'll try it again with "output feature vectors" and "calculate information gain" turned off.


Reply to this email directly or view it on GitHub: https://github.com/psal/JStylo-Anonymouth/issues/1#issuecomment-6176494

Sadia

dan-blanchard commented 12 years ago

There are no messages after that, so you're probably right that it's the tagger. However, from the file name of the model that you're loading, it sounds like you're just using the Stanford Tagger, and I've previously tagged all of these essays with the same model without issue.

Thanks for being so responsive and helpful about this, by the way.

sheetal57 commented 12 years ago

It happened to me before when I was processing some file with the Stanford tagger. I haven't pinpoint exactly why and when it gets stuck though. I'll let you know if I figure that out.

On Thu, Jun 7, 2012 at 7:24 AM, Dan Blanchard < reply@reply.github.com

wrote:

There are no messages after that, so you're probably right that it's the tagger. However, from the file name of the model that you're loading, it sounds like you're just using the Stanford Tagger, and I've previously tagged all of these essays with the same model without issue.

Thanks for being so responsive and helpful about this, by the way.


Reply to this email directly or view it on GitHub: https://github.com/psal/JStylo-Anonymouth/issues/1#issuecomment-6177056

Sadia

dan-blanchard commented 12 years ago

I've started things up again and removed all of the POS-related features from the set. I'm using:

> Character count
> Average characters per word
> Letters
> Top Letter bigrams
> Top Letter trigrams
> Digits Percentage
> Letters Percentage
> Uppercase Letters Percentage
> Digits
> Two Digit Numbers
> Three Digit Numbers
> Word Lengths
> Special Characters
> Function Words
> Punctuation
> Words
> Word Bigrams
> Word Trigrams
> Misspelled Words

and with the same set of 11,000 essays it has been running for two days so far. Is that to be expected? The last thing the log says is "Extracting features from training corpus...".

sheetal57 commented 12 years ago

Sorry, I don't have any immediate answer to that. We need to look into this issue.

On Thu, Jun 14, 2012 at 6:27 AM, Dan Blanchard < reply@reply.github.com

wrote:

I've started things up again and removed all of the POS-related features from the set. I'm using:

> Character count
> Average characters per word
> Letters
> Top Letter bigrams
> Top Letter trigrams
> Digits Percentage
> Letters Percentage
> Uppercase Letters Percentage
> Digits
> Two Digit Numbers
> Three Digit Numbers
> Word Lengths
> Special Characters
> Function Words
> Punctuation
> Words
> Word Bigrams
> Word Trigrams
> Misspelled Words

and with the same set of 11,000 essays it has been running for two days so far. Is that to be expected? The last thing the log says is "Extracting features from training corpus...".


Reply to this email directly or view it on GitHub: https://github.com/psal/JStylo-Anonymouth/issues/1#issuecomment-6327915

Sadia

arielstolerman commented 12 years ago

What about the command-line log? is that identical or perhaps an exception has been thrown? I would leave that be, as it is a very large corpus, and in the meanwhile, construct a sub corpus (by simply removing most of the documents from your problem set) and run a similar experiment over it - to see if it works relatively fast over a small problem. Please let us know how it goes.

dan-blanchard commented 12 years ago

Here's the entire command line log:

Look-and-Feel error!
11-22-18: Reading CumulativeFeatureDriver from /private/scratch/dblanchard/jsan/jsan_resources/feature_sets/9_features.xml
11-22-19: Reading CumulativeFeatureDriver from /private/scratch/dblanchard/jsan/jsan_resources/feature_sets/writeprints_feature_set.xml
11-22-19: Reading CumulativeFeatureDriver from /private/scratch/dblanchard/jsan/jsan_resources/feature_sets/writeprints_feature_set_limited.xml
11-22-19: Populating event drivers...
11-22-19: adding event drivers under Basic
11-22-19: adding event drivers under Part-Of-Speech
11-22-19: adding event drivers under Grams
11-22-19: adding event drivers under Dictionary
11-22-19: adding event drivers under Counters
11-22-19: adding event drivers under Readability Metrics
11-22-19: adding event drivers under Misc.
11-22-19: done!
11-22-19: Populating canonicizers...
11-22-19: done!
11-22-19: Populating event cullers...
11-22-19: done!
11-22-24: 'Load Problem Set' button clicked on the documents tab
11-22-33: Trying to load problem set from /private/scratch/dblanchard/NLI/toefl-public-jstylo.xml
11-22-34: GUI Update: update documents tab with current problem set started
11-22-44: 'Remove Author(s)' button clicked under the 'Training Corpus' section on the documents tab.
11-22-46: Removed authors:
                > English
11-22-51: Preset feature set selected in the features tab.
11-22-51: loaded preset feature set: WritePrints
11-22-53: Feature selected in the features tab: POS Tags
11-22-57: 'Remove' feature button clicked in the features tab.
11-22-58: Feature selected in the features tab: null
11-22-58: Removed feature POS Tags
11-22-59: Feature selected in the features tab: POS Bigrams
11-23-00: 'Remove' feature button clicked in the features tab.
11-23-02: Feature selected in the features tab: null
11-23-02: Removed feature POS Bigrams
11-23-03: Feature selected in the features tab: POS Trigrams
11-23-04: 'Remove' feature button clicked in the features tab.
11-23-05: Feature selected in the features tab: null
11-23-05: Removed feature POS Trigrams
11-23-09: Classifier selected in the available classifiers tree in the classifiers tab: SMO
11-23-10: 'Add' button clicked in the analysis tab.
11-23-10: Classifier tree unselected in the classifiers tab.
11-23-14: Calculate InfoGain checkbox was clicked on the analysis tab.
11-23-14: Calculate InfoGain option - unselected
11-23-16: 'Run Analysis' button clicked in the analysis tab.
11-23-17: >>> Run Analysis thread started.
11-23-17: Extracting features from training corpus...

I've done it with a very small subset of my corpus (15 documents from 3 authors) and it finishes very quickly, so I guess the issue is that the tool isn't really intended for corpora of the size we're dealing with. The 10,000 document set is already a small subset of the actual 100,000 document corpus I was originally planning on processing.