ptwobrussell / Mining-the-Social-Web-2nd-Edition

The official online compendium for Mining the Social Web, 2nd Edition (O'Reilly, 2013)
http://bit.ly/135dHfs
Other
2.9k stars 1.49k forks source link

Chapter 6 ex2 #316

Open jlm2239 opened 7 years ago

jlm2239 commented 7 years ago

Hi, I am trying to take the Enron data and convert to a json file so that I can run another program that I have working that uses kmeans to cluster the emails. I tried the code from 6.2 but I am getting the error: "unknown string format" for _date=asctime(parse(_date).timetuple()). I tried print _date and I get the entire message (so maybe re.search is not correct?). I am not very knowledgeable on regex (but it looks like I copied the code exactly). Any suggestions? The mbox is created but is empty too. I saw that there was another similar issue posted on this example but it wasn't resolved.

import re import email from time import asctime import os import sys from dateutil.parser import parse

MAILDIR = 'C:\Users\John\Documents\Sales Management Analytics\Enron Data\enron_mail\maildir'

MBOX = 'C:\Users\John\Documents\Sales Management Analytics\Enron Data\enron.mbox'

mbox = open(MBOX,'w')

for (root, dirs, file_names) in os.walk(MAILDIR): if root.split(os.sep)[-1].lower() != 'inbox': continue

# process each message in "inbox"
for file_name in file_names:
    file_path = os.path.join(root,file_name)
    message_text = open(file_path).read()

    # Compute fields from the From_ line in a traditional mbox message

    _from = re.search(r"From: ([^\r]+)", message_text).groups()[0]
    _date = re.search(r"Date: ([^\r]+)", message_text).groups()[0]

    # Convert _date to the asctime representation for the From_ line
    print 'Here is the date', _date

    _date = asctime(parse(_date).timetuple())

    msg = email.message_from_string(message_text)
mxli417 commented 6 years ago

Hi,

I guess I have found a viable solution. The code is incomplete at a crucial point - at least to my eye it seems to be. Writing just \r in the regular expression definition results in the same error for me as for you. I had a short look at the documentation and googled a bit, and it turns out that replacing the above \r with \r\n does the trick. I tested it, and now it's happily munching away at the data.

Best, M.

P.S.: My tweaked version of Matthew A. Russells glorious script


mbox = open(MBOX, 'w')

# Walk the directories and process any folder named 'inbox'
#setup a counter to keep track of progress
mycount = 0

for (root, dirs, file_names) in os.walk(MAILDIR):

    if root.split(os.sep)[-1].lower() != 'inbox':
        continue

    # Process each message in 'inbox'

    for file_name in file_names:
        file_path = os.path.join(root, file_name)
        message_text = open(file_path).read()

        # Compute fields for the From_ line in a traditional mbox message

        _from = re.search(r"From: ([^\r\n]+)", message_text).groups()[0]
        _date = re.search(r"Date: ([^\r\n]+)", message_text).groups()[0]

        # Convert _date to the asctime representation for the From_ line
        mycount +=1
        print("Doing message: " + str(mycount))
        _date = asctime(parse(_date).timetuple())

        msg = email.message_from_string(message_text)
        msg.set_unixfrom('From %s %s' % (_from, _date))

        mbox.write(msg.as_string(unixfrom=True) + "\n\n")

mbox.close()