Closed scday closed 6 years ago
Please check.
-- Anderson
Prof. Dr. Anderson Rocha Associate Director, Institute of Computing UNIVERSITY OF CAMPINAS, SP - BRAZIL Digital Forensics and Machine Intelligence http://www.ic.unicamp.br/~rocha
On Nov 15, 2017, 3:07 AM -0200, SCDay notifications@github.com, wrote:
I've had great success running everything up until the classification point. I've run the n-grams script and now when I try to complete any of the classifiers I get this error "Traceback (most recent call last): File "pmsvm_pca_classifier.py", line 253, in authors_list = filter_authors(args.source_dir_data, args.min_tweets) File "pmsvm_pca_classifier.py", line 114, in filterauthors if threshold <= int(os.path.basename(filename).split('')[0]): ValueError: invalid literal for int() with base 10:" This happens with each classifier. Any help would be greatly appreciated. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi @scday , I believe you're passing to --source-dir-data option a directory of tweets used in previous steps and not the one generated by --dest-dir option of ngrams_generator.py . Please double check if the directory passed to --source-dir-data option of the classifiers contains subdirectories with name pattern \
I used this command for the ngrams ngrams_generator.py --source-dir my_input_dir --dest-dir my_output_dir --features all --debug . I changed the --dest-dir to my_ngrams_dir which created a folder for each author (I'm using you guys dataset) . In the folder for each author are about 11 files (see attached).
Is that not the correct item to pass in for the classification? Thanks again for your assistance.
Hi @scday . In our full pipeline (collecting data from Twitter and pre-processing the tweets), these author folders are generated with the pattern \
As I believe you've got this dataset from us (skipping some of the pipeline steps), we were obligated by Twitter terms to anonymize the data and I think is this the problem you are facing: the folders must have the pattern \
I hope have helped you.
You have helped tremedously. The only step I did not follow was using the filter_language_by_tweet . Question if I go back and begin the process again and involve that step will that resolve everything or do you think it's just better to write and script that rename the files?
I suggest you begin the process again with that step involved.
Good luck.
Ok. Thank you.
I've had great success running everything up until the classification point. I've run the n-grams script and now when I try to complete any of the classifiers I get this error "Traceback (most recent call last): File "pmsvm_pca_classifier.py", line 253, in
authors_list = filter_authors(args.source_dir_data, args.min_tweets)
File "pmsvm_pca_classifier.py", line 114, in filterauthors
if threshold <= int(os.path.basename(filename).split('')[0]):
ValueError: invalid literal for int() with base 10:"
This happens with each classifier. Any help would be greatly appreciated.