Open meehaw1337 opened 6 years ago
These python codes are written on python2 make sure you are using the same version
Modified email_preprocess.py for python3 `#!/usr/bin/python
import pickle
import numpy
from sklearn.model_selection import cross_val_score from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_selection import SelectPercentile, f_classif from sklearn.model_selection import train_test_split
def preprocess(words_file = "../tools/word_data.pkl", authors_file="../tools/email_authors.pkl"): """ this function takes a pre-made list of email texts (by default word_data.pkl) and the corresponding authors (by default email_authors.pkl) and performs a number of preprocessing steps: -- splits into training/testing sets (10% testing) -- vectorizes into tfidf matrix -- selects/keeps most helpful features
after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions
4 objects are returned:
-- training/testing features
-- training/testing labels
"""
### the words (features) and authors (labels), already largely preprocessed
### this preprocessing will be repeated in the text learning mini-project
authors_file_handler = open(authors_file, "rb")
authors = pickle.load(authors_file_handler)
authors_file_handler.close()
words_file_handler = open(words_file, "rb")
word_data = pickle.load(words_file_handler)
words_file_handler.close()
### test_size is the percentage of events assigned to the test set
### (remainder go into training)
features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)
### text vectorization--go from strings to lists of numbers
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed = vectorizer.transform(features_test)
### feature selection, because text is super high dimensional and
### can be really computationally chewy as a result
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
features_test_transformed = selector.transform(features_test_transformed).toarray()
### info on the data
print ("no. of Chris training emails:", sum(labels_train))
print ("no. of Sara training emails:", len(labels_train)-sum(labels_train))
return features_train_transformed, features_test_transformed, labels_train, labels_test`
Hello @Mathanraj-Sharma , I tried using your code for email_preprocess.py. However, I am still getting error(s)
Hello @Sarita19 , It's worked for me when I've modified code like this:
def preprocess(words_file = "./tools/word_data.pkl", authors_file="./tools/email_authors.pkl")
Hi,
The issue I keep having is this:
I tried both with the original code of email_preprocessing and in that file when I run the code I don't get any error (just a few and fixed them) but when I run and I debugged it too, no issues! I also tried to replace with Python 3 version suggested earlier just to be on the safe side, and that also worked with no problems whatsoever.
The real issue occurs when you tried to run it in the nb_author_id file. someone suggested to keep both email_preprocessing and nb_author_id in the folder and I did - IT STILL DOESN'T WORK!
Honestly, I know that the source code is written in Python 2; however, I don't think it's smart to install Python 2 at all. It conflicts with other code projects and other libraries.
I have been trying to solve this issue for the past 3 days and I get really tired of it. Can everyone really make it work?
Thanks!
Hi @zuhaldanyildiz!
Looks like this repository is not maintained anymore.
Feel free to check out my fork of ud120. I refactored and ported most of the code from this repo into Python 3 and Jupyter notebooks,
Hi @trsvchn,
Thanks for help, I'll go ahead and check it! I'm also glad it's in Jupyter Notebook. I can't even entirely interpret the data I'm dealing with in PyCharm.
Thanks again!
I am currently experiencing some difficulties with using Atom to run my python code, that otherwise works when launched through the command prompt. For those unfamiliar with Udacity's Introduction to Machine Learning, the "email preprocess" module is located in "...\naive_bayes\tools" directory.
Code:
Whenever i run the nb_author_id.py file through the command prompt with the following command:
python2 nb_author_id.py
in the D:\Misiek\Pulpit\python\ud120-projects-master\naive_bayes directory, it works fine. But, if want to run the nb_author_id.py file through Atom (using atom-runner) I get the error message:Any ideas why it works through the command prompt, but not through Atom?