nalbanders commented 9 years ago

Hi, I found this parser very useful, I am doing a school project to analyze chat data. Thank you for contributing.

One thing I'd like to do is see the difference in response time between the root and their contact (much in the same way you calculate messages, characters, etc.)

Is there a way you recommend to add this functionality?

nmoya commented 9 years ago

Hello @nalbanders !

Thanks for your interest. I don't have a log file right now. Could you please refresh my memory and check if a log file contains a timestamp of the time that a message was sent/received?

If so, a first step should be to parse this string as a timestamp structure in python. Python provides several libraries to work with date and time.

You can have a glimpse of what can be done with other file of mine here: https://github.com/nmoya/glaucobot/blob/master/glaucobot/datelib.py

My best suggestion is that you should not perform manual calculations over timestamps. Always use a well tested library to work with date and time.

I am interested in working together to add this feature if you like.

Cheers,

PS. Also, if you are getting started with computing, check this video: https://www.youtube.com/watch?v=-5wpm-gesOY

nalbanders commented 9 years ago

Hi Nikolas,

Thanks for the reply. I will try to work it in as you suggest.

It's actually a fun project I am working on which your script is helping a lot. I am a student at MIT. I'd be happy to have a call and discuss the project and see if you'd have any interest working together. I am doing two different studies interpreting relationships based on conversation data. Always looking to connect with people who are interested and who are skilled like yourself.

Here is my LinkedIn profile https://www.linkedin.com/pub/armen-nalband/13/65/656

On Fri, Apr 24, 2015 at 3:37 PM, Nikolas Moya notifications@github.com wrote:

Hello @nalbanders https://github.com/nalbanders !

Thanks for your interest. I don't have a log file right now. Could you please refresh my memory and check if a log file contains a timestamp of the time that a message was sent/received?

If so, a first step should be to parse this string as a timestamp structure in python. Python provides several libraries to work with date and time.

You can have a glimpse of what can be done with other file of mine here: https://github.com/nmoya/glaucobot/blob/master/glaucobot/datelib.py

My best suggestion is that you should not perform manual calculations over timestamps. Always use a well tested library to work with date and time.

I am interested in working together to add this feature if you like.

Cheers,

PS. Also, if you are getting started with computing, check this video: https://www.youtube.com/watch?v=-5wpm-gesOY

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-96043413.

nmoya commented 9 years ago

Hello @nalbanders ,

Sure! Let's schedule a call and discuss more about the project. Are you available on Wednesday? My Skype/Hangout is nikolasmoya.

I also sent a connect invitation on Linkedin.

nalbanders commented 9 years ago

Great, how about Wednesday at 5:30 EST? I am in Boston, what city are you? On Apr 27, 2015 2:09 PM, "Nikolas Moya" notifications@github.com wrote:

Hello @nalbanders https://github.com/nalbanders ,

Sure! Let's schedule a call and discuss more about the project. Are you available on Wednesday? My Skype/Hangout is nikolasmoya.

I also sent a connect invitation on Linkedin.

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-96762861.

nmoya commented 9 years ago

I am in Curitiba (BRT). Let's try a little bit later, like, after work, how about [17, ..., 21h] EST?

nalbanders commented 9 years ago

Sorry, I meant 5:30PM (17:30). I believe you are one hour ahead so that would be 18:30 your time. Does that work?

If not we can do 6:30PM EST (18:30)

On Mon, Apr 27, 2015 at 5:22 PM, Nikolas Moya notifications@github.com wrote:

I am in Curitiba (BRT). Let's try a little bit later, like, after work, how about [17, ..., 21h] EST?

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-96826057.

nmoya commented 9 years ago

Oh, alright then. 5:30 PM EST is great for me.

nalbanders commented 9 years ago

Ok, I will call you then tomorrow on Skype. I just sent you a contact request.

Looking forward to it, Armen

On Mon, Apr 27, 2015 at 5:33 PM, Nikolas Moya notifications@github.com wrote:

Oh, alright then. 5:30 PM EST is great for me.

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-96828128.

nalbanders commented 9 years ago

Wednesday*

On Mon, Apr 27, 2015 at 5:35 PM, Armen Nalband nalbana@gmail.com wrote:

Ok, I will call you then tomorrow on Skype. I just sent you a contact request.

Looking forward to it, Armen

On Mon, Apr 27, 2015 at 5:33 PM, Nikolas Moya notifications@github.com wrote:

Oh, alright then. 5:30 PM EST is great for me.

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-96828128 .

nalbanders commented 9 years ago

Hey, some context for our call tomorrow

Attached the main.py file that I modified

My goal is to be able to understand the relationships of the user based on communication data (be able to predict who they care about most/least) Here are some graphs I generated with the script. Attached is an output csv I am building that I will use to do regression analysis (logistic, CART, Random Forest) in R. [image: Inline image 3] [image: Inline image 1]

On a separate note, I have other development projects going on, always open for skilled people like yourself to get involved if you find yourself interested.

Here is a wireframe of an app I am creating. We can chat about it separately. https://www.justinmind.com/usernote/tests/14265484/14740364/14740366/index.html

On Mon, Apr 27, 2015 at 5:36 PM, Armen Nalband nalbana@gmail.com wrote:

Wednesday*

On Mon, Apr 27, 2015 at 5:35 PM, Armen Nalband nalbana@gmail.com wrote:

Ok, I will call you then tomorrow on Skype. I just sent you a contact request.

Looking forward to it, Armen

On Mon, Apr 27, 2015 at 5:33 PM, Nikolas Moya notifications@github.com wrote:

Oh, alright then. 5:30 PM EST is great for me.

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-96828128 .

nalbanders commented 9 years ago

Attachment

On Wed, Apr 29, 2015 at 1:19 AM, Armen Nalband nalbana@gmail.com wrote:

Hey, some context for our call tomorrow

Attached the main.py file that I modified

My goal is to be able to understand the relationships of the user based on communication data (be able to predict who they care about most/least) Here are some graphs I generated with the script. Attached is an output csv I am building that I will use to do regression analysis (logistic, CART, Random Forest) in R. [image: Inline image 3] [image: Inline image 1]

On a separate note, I have other development projects going on, always open for skilled people like yourself to get involved if you find yourself interested.

Here is a wireframe of an app I am creating. We can chat about it separately.

https://www.justinmind.com/usernote/tests/14265484/14740364/14740366/index.html

On Mon, Apr 27, 2015 at 5:36 PM, Armen Nalband nalbana@gmail.com wrote:

Wednesday*

On Mon, Apr 27, 2015 at 5:35 PM, Armen Nalband nalbana@gmail.com wrote:

Ok, I will call you then tomorrow on Skype. I just sent you a contact request.

Looking forward to it, Armen

On Mon, Apr 27, 2015 at 5:33 PM, Nikolas Moya notifications@github.com wrote:

Oh, alright then. 5:30 PM EST is great for me.

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-96828128 .

from future import division from datetime import datetime import codecs import date import re import operator import sys import json import csv

import numpy

from pprint import pprint

class Chat(): def init(self, filename): self.filename = filename self.raw_messages = []

    self.datelist = []
    self.timelist = []
    self.senderlist = []
    self.messagelist = []
    self.chatTimeList = []
    self.rootResponseTimeList = []
    self.contactResponseTimeList = []
    self.rootBurstList = []
    self.contactBurstList = []
    #self.responseTimeList.append(0)

def open_file(self):
    arq = codecs.open(self.filename, "r", "utf-8-sig")
    content = arq.read()
    arq.close()
    lines = content.split("\n")
    lines = [l for l in lines if len(l) != 1]
    for l in lines:
        self.raw_messages.append(l.encode("utf-8"))

def feed_lists(self):
    for l in self.raw_messages:
        msg_date, sep, msg = l.partition(": ")
        raw_date, sep, time = msg_date.partition(" ")
        sender, sep, message = msg.partition(": ")
        #print ("\n\n\nRAW: ")
        #print (raw_date)
        raw_date = raw_date.replace(",", "")
        #print (raw_date)
        #print ("\n\n\n")
        if message:
            self.datelist.append(raw_date) 
            self.timelist.append(time) #here is the time object; save it              
            colonIndex = [x.start() for x in re.finditer(':', l)]
            #print ind
            chatTimeString = l[0:colonIndex[2]] #grab the characters that make up the date and time (Everthing until the third colon
            chatTime = datetime.strptime(chatTimeString, "%m/%d/%y, %I:%M:%S %p") #convert to a data object, format of the whatsapp data 8/2/14, 12:59:24 PM
            self.chatTimeList.append(chatTime)                               
            self.senderlist.append(sender)
            self.messagelist.append(message)
        else:
            self.messagelist.append(l)
    t0=self.chatTimeList[0]
    senderIndex=0;
    burstCount=1; #variable to count the number of messages in a row sent by sender

    rootName = "ROOT"
    contactName = "CONTACT"

    for t1 in self.chatTimeList[1:]: #perform the operations that are dependant on multiple messages (response time, bursts)
        dt = t1-t0
        if self.senderlist[senderIndex] != self.senderlist[senderIndex-1]: #is sender the same as the last message?
            #sender changed, store the burst count and reset 
            print("sender changed: %s") %(self.senderlist[senderIndex])
            print("response time: %d\n" %(dt.seconds) )
            if self.senderlist[senderIndex] == rootName:    #is sender the root?
                self.rootBurstList.append(burstCount)
                self.rootResponseTimeList.append(dt.seconds)                    
            elif self.senderlist[senderIndex] == contactName: #is sender the contact?
                self.contactBurstList.append(burstCount)
                self.contactResponseTimeList.append(dt.seconds)
            else:   
                sys.exit("ERROR CHANGE NAMES IN CHAT TO ROOT AND CONTACT\n")                    
            burstCount = 1  

            #save 

        else:
            burstCount+=1 #accumulate the number of messages sent in a row  
            print"repeat sender: %d %s\n" %(burstCount, self.senderlist[senderIndex])

        #self.responseTimeList.append(dt.seconds)
        t0 = t1            
        senderIndex+=1

def print_history(self, end=0):
    if end == 0:
        end = len(self.messagelist)
    for i in range(len(self.messagelist[:end])):
        print self.datelist[i], self.timelist[i],\
            self.senderlist[i], self.messagelist[i]

def get_senders(self):
    senders_set = set(self.senderlist)
    return [e for e in senders_set]

def count_messages_per_weekday(self):
    counter = dict()
    for i in range(len(self.datelist)):
        month, day, year = self.datelist[i].split("/") #AN edited date order
        parsed_date = "%s-%s-%s" % (year, month, day)
        #print ("DATE: ")
        #print (parsed_date)
        #print ("\n\n")
        weekday = date.date_to_weekday(parsed_date)
        if weekday not in counter:
            counter[weekday] = 1
        else:
            counter[weekday] += 1
    return counter

def count_messages_per_shift(self):
    shifts = {
        "latenight": 0,
        "morning": 0,
        "afternoon": 0,
        "evening": 0
    }
    for i in range(len(self.timelist)):
        hour = int(self.timelist[i].split(":")[0])
        if hour >= 0 and hour <= 6:
            shifts["latenight"] += 1

        elif hour > 6 and hour <= 11:
            shifts["morning"] += 1

        elif hour > 11 and hour <= 17:
            shifts["afternoon"] += 1

        elif hour > 17 and hour <= 23:
            shifts["evening"] += 1
    return shifts

def count_messages_pattern(self, patternlist):
    counters = dict()
    pattern_dict = dict()
    senders = self.get_senders()
    for pattern in patternlist:
        counters[pattern] = dict()
        for s in senders:
            counters[pattern][s] = 0
        pattern_dict[pattern] = re.compile(re.escape(pattern), re.I) #re=regular expression, .I = ignore case, .compile = convert to object 
    for i in range(len(self.messagelist)):
        for pattern in patternlist:
            search_result = pattern_dict[pattern].\
                findall(self.messagelist[i])
            length = len(search_result)
            if length > 0:
                if pattern not in counters:
                    counters[pattern][self.senderlist[i]] = length
                else:
                    counters[pattern][self.senderlist[i]] += length
    return counters

def print_patterns_dict(self, pattern_dict):
    for pattern in pattern_dict:
        print pattern
        for s in pattern_dict[pattern]:
            print s, ": ", pattern_dict[pattern][s]
        print ""

def message_proportions(self):
    senders = self.get_senders()
    counter = dict()
    total = 0
    for i in ["messages", "words", "chars", "qmarks", "media"]:
        counter[i] = dict()
        for s in senders:
            counter[i][s] = 0
    for i in range(len(self.senderlist)):
        counter["messages"][self.senderlist[i]] += 1
        counter["words"][self.senderlist[i]] += \
            len(self.messagelist[i].split(" "))
        counter["chars"][self.senderlist[i]] += len(self.messagelist[i])
        counter["qmarks"][self.senderlist[i]] += self.messagelist[i].count('?')
        counter["media"][self.senderlist[i]] += (self.messagelist[i].count('<media omitted>')+self.messagelist[i].count('<image omitted>')+self.messagelist[i].count('<audio omitted>'))
        total += 1
    counter["total_messages"] = 0
    counter["total_words"] = 0
    counter["total_chars"] = 0
    counter["total_qmarks"] = 0
    counter["total_media"] = 0

    for s in senders:
        counter["total_messages"] += counter["messages"][s]
        counter["total_words"] += counter["words"][s]
        counter["total_chars"] += counter["chars"][s]
        counter["total_qmarks"] += counter["qmarks"][s]
        counter["total_media"] += counter["media"][s]
    return counter

def average_message_length(self):
    msg_prop = self.message_proportions()
    counter = dict()
    for s in self.get_senders():
        counter[s] = msg_prop["words"][s] / msg_prop["messages"][s]
    return counter

def most_used_words(self, top=10, threshold=3):
    words = dict()
    for i in range(len(self.messagelist)):
        message_word = self.messagelist[i].split(" ")
        for w in message_word:
            if len(w) > threshold:
                w = w.decode("utf8")
                w = w.replace("\r", "")
                w = w.lower()
                if w not in words:
                    words[w] = 1
                else:
                    words[w] += 1
    sorted_words = sorted(words.iteritems(), key=operator.itemgetter(1),
                          reverse=True)
    counter = 0
    output = sorted_words[:top]
    return output

def printDict(dic, parent, depth): tup = sorted(dic.iteritems(), key=operator.itemgetter(1)) isLeaf = True for key in tup: if isinstance(dic[key[0]], dict): isLeaf = False if isLeaf and depth!=0: print " "_(depth-1)_2, parent for key in tup: if isinstance(dic[key[0]], dict): printDict(dic[key[0]], key[0], depth+1) else: print " "_depth_2, str(key[0]), "->", dic[key[0]]

def main(): if len(sys.argv) < 2: print "Run: python main.py [regex. patterns]" sys.exit(1) c = Chat(sys.argv[1]) c.open_file() c.feed_lists() output = dict()

print "\n--PROPORTIONS"
output["proportions"] = c.message_proportions()
printDict(output["proportions"], "proportions", 0)

print "\n--SHIFTS"
output["shifts"] = c.count_messages_per_shift()
printDict(output["shifts"], "shifts", 0)

print "\n--WEEKDAY"
output["weekdays"] = c.count_messages_per_weekday()
printDict(output["weekdays"], "weekday", 0)

print "\n--AVERAGE MESSAGE LENGTH"
output["lengths"] = c.average_message_length()
printDict(output["lengths"], "lengths", 0)

print "\n--PATTERNS"
output["patterns"] = c.count_messages_pattern(sys.argv[2:])
printDict(output["patterns"], "patterns", 0)

print "\n--TOP 15 MOST USED WORDS (length >= 3)"
output["most_used_words"] = c.most_used_words(top=15, threshold=3)
output["most_used_words"] = sorted(output["most_used_words"], key=operator.itemgetter(1), reverse=True)
#print output["most_used_words"]
#for muw in output["most_used_words"]:
#    print muw[0]

print "TIMESTAMPS\n %s\n\n" %c.chatTimeList[0:4]
print "Root Response time sample \n %s...\n" %c.rootResponseTimeList[0:4]
print "Contact Response time sample \n %s...\n" %c.contactResponseTimeList[0:4]
print "Root bursts \n %s\n" %c.rootBurstList
print "Contact bursts \n %s\n" %c.contactBurstList

print "Median response time =%s\n\n" %(numpy.median(c.responseTimeList))

output["senders"] = c.get_senders()
#filename = sys.argv[1].split("/")[-1]
#arq = open("./logs/"+filename+".json", "w")
#arq = open("filename.json", "w")
nameTest = sys.argv[1] 
arq = open("C:/Python27/"+nameTest+".json", "w")
arq.write(json.dumps(output))
pprint(output)
arq.close()

with open('names.csv', 'w') as csvfile:

fieldnames = ['msgs_root', 'msgs_contact', 'chars_root', 'chars_contact', 'qmarks_root', 'qmarks_contact']

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()

#    writer.writerow({'msgs_root': c.message_proportions , 'last_name': 'Beans'})

main()

nmoya commented 9 years ago

Hello @nalbanders !

I will add you as a contributor to the repository so that you have write access. Could you please commit your adapted main file with a different name?

Also, something came up tomorrow at 6:30 PM EST. Do you mind changing our call to 4:30 PM EST or 5 PM EST? If it is not possible, it's alright, but I will need to leave at 6:15PM EST and then we can reschedule a new call if 45 minutes are not enough. I will be on Skype tomorrow's afternoon, so if you arrive earlier, we can start earlier otherwise we keep the original schedule :-)

Also, your graphs did not show up. I was looking forward to see them! :( Great job on the modifications in the main file!

nalbanders commented 9 years ago

Ok, will try to call at 5 instead.

Will push to git tomorrow.

Thanks, A On Apr 29, 2015 1:53 AM, "Nikolas Moya" notifications@github.com wrote:

Hello @nalbanders https://github.com/nalbanders !

I will add you as a contributor to the repository so that you have write access. Could you please commit your adapted main file with a different name?

Also, something came up tomorrow at 6:30 PM EST. Do you mind changing our call to 4:30 PM EST or 5 PM EST? If it is not possible, it's alright, but I will need to leave at 6:15PM EST and then we can reschedule a new call if 45 minutes are not enough. I will be on Skype tomorrow's afternoon, so if you arrive earlier, we can start earlier otherwise we keep the original schedule :-)

Great job in your modifications in the main file!

— Reply to this email directly or view it on GitHub https://github.com/nmoya/whatsapp-parser/issues/2#issuecomment-97315242.

nmoya / whatsapp-parser

Is there a way to calculate response times #2

import numpy

print "Median response time =%s\n\n" %(numpy.median(c.responseTimeList))

with open('names.csv', 'w') as csvfile:

fieldnames = ['msgs_root', 'msgs_contact', 'chars_root', 'chars_contact', 'qmarks_root', 'qmarks_contact']

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()