theocjr / social-media-forensics

Fosensic Tools for Social Media
15 stars 4 forks source link

Classification #1

Closed scday closed 6 years ago

scday commented 6 years ago

I've had great success running everything up until the classification point. I've run the n-grams script and now when I try to complete any of the classifiers I get this error "Traceback (most recent call last): File "pmsvm_pca_classifier.py", line 253, in authors_list = filter_authors(args.source_dir_data, args.min_tweets) File "pmsvm_pca_classifier.py", line 114, in filterauthors if threshold <= int(os.path.basename(filename).split('')[0]): ValueError: invalid literal for int() with base 10:"

This happens with each classifier. Any help would be greatly appreciated.

arrocha commented 6 years ago

Please check.

-- Anderson


Prof. Dr. Anderson Rocha Associate Director, Institute of Computing UNIVERSITY OF CAMPINAS, SP - BRAZIL Digital Forensics and Machine Intelligence http://www.ic.unicamp.br/~rocha


On Nov 15, 2017, 3:07 AM -0200, SCDay notifications@github.com, wrote:

I've had great success running everything up until the classification point. I've run the n-grams script and now when I try to complete any of the classifiers I get this error "Traceback (most recent call last): File "pmsvm_pca_classifier.py", line 253, in authors_list = filter_authors(args.source_dir_data, args.min_tweets) File "pmsvm_pca_classifier.py", line 114, in filterauthors if threshold <= int(os.path.basename(filename).split('')[0]): ValueError: invalid literal for int() with base 10:" This happens with each classifier. Any help would be greatly appreciated. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

theocjr commented 6 years ago

Hi @scday , I believe you're passing to --source-dir-data option a directory of tweets used in previous steps and not the one generated by --dest-dir option of ngrams_generator.py . Please double check if the directory passed to --source-dir-data option of the classifiers contains subdirectories with name pattern \_\ (like 03152_1929050916).

scday commented 6 years ago

I used this command for the ngrams ngrams_generator.py --source-dir my_input_dir --dest-dir my_output_dir --features all --debug . I changed the --dest-dir to my_ngrams_dir which created a folder for each author (I'm using you guys dataset) . In the folder for each author are about 11 files (see attached).

screenshot 2017-11-15 09 45 26

Is that not the correct item to pass in for the classification? Thanks again for your assistance.

theocjr commented 6 years ago

Hi @scday . In our full pipeline (collecting data from Twitter and pre-processing the tweets), these author folders are generated with the pattern \_\. These "number of tweets" accounting is done in the first pre-processing step in the filter_language_by_tweet.py code and is used to filter out authors with too few messages through the classification pipeline.

As I believe you've got this dataset from us (skipping some of the pipeline steps), we were obligated by Twitter terms to anonymize the data and I think is this the problem you are facing: the folders must have the pattern \_\ but the ones you have at hand only present an opaque text instead as user id. You could make a script rewriting each of these folders with this pattern ( \_\ ). You can get this number of tweets opening any .pkl file and getting the length of the array inside.

I hope have helped you.

scday commented 6 years ago

You have helped tremedously. The only step I did not follow was using the filter_language_by_tweet . Question if I go back and begin the process again and involve that step will that resolve everything or do you think it's just better to write and script that rename the files?

theocjr commented 6 years ago

I suggest you begin the process again with that step involved.

Good luck.

scday commented 6 years ago

Ok. Thank you.