Open louismartin opened 7 years ago
Yes, true that. We could do 2 things:
This kind of emails represent half of the dataset: 20860. Problem : We don't have the email addresses but shortcuts for them. Example: mid 881 (extract):
-----Original Message-----From: Stanton, Lon Sent: Thursday, July 12, 2001 12:49 PMTo: Miller, Mary Kay; Nelson, Michel; Neubauer, Dave; McGowan, Mike W.; Fossum, Drew; Talcott, Jim; Porter, Gregory J.; Gilbert, Tom; Poock, Brian; team.waterloo plant@enron.com; Dushinske, John; Brennan, Lorna; Lehan, Tom; Shepherd, Richard; Jensen, BethSubject:
Same kind of issue for mails with a forward. We should remove that too ! (and maybe consider using the emails stated to lower their probabilities).
Some mails contain forwarded information which is both a curse and a blessing. It is a curse because the bag of words is completely polluted with all the information but a blessing because we have all the email addresses of the previous recipients !!!!!!!!
Example: mid: 51172