tmozgach / ent_ob

Entrepreneur’s online behavior
1 stars 0 forks source link

1. Data gathering: crawl data from Reddit. #1

Open tmozgach opened 6 years ago

tmozgach commented 6 years ago

*

neowangkkk commented 6 years ago

Glad to form a work team here. The first job is that we want to crawl more data from the target community. https://www.reddit.com/r/Entrepreneur/

The time period we want to crawl is from 2012.1.1 to 2017.12.31.

The required variables are shown in the attachment. If you have any questions, please let me know. screen shot 2018-01-09 at 1 22 38 pm

Sample.xlsx

tmozgach commented 6 years ago

@neowangkkk I haven't done it before, so in order to save time and be consistence could you please give me the script or method/web link how you did it before. Is that like that? https://www.labnol.org/internet/web-scraping-reddit/28369/

neowangkkk commented 6 years ago

@tmozgach Last time I paid $150 to hire a part-time programmer to crawl the data. He told me he used C+ language to write the program. I don't believe he will give me the code:-(

I am not sure about the difficulty level of doing web crawling at reddit.com. Can you please search and check if any python package or sth else can do it? If after 20 work hours it is still a problem, we may go to the part-time programmer again. I understand some websites put a lot of tricks to prevent people crawling their content. It may be a huge task that only people with years of experience in crawling can handle. But it is worthy for you to learn and try when our time still allows.

The google sheet method in your link may have some flaws. It said sub-reddit can only show 1000 posts. But our last crawling got over 25,000 threads for 18 months.

If you have any questions, please feel free to let me know.

tmozgach commented 6 years ago

PRAW API

Install pip and draw without root for LINUX: https://gist.github.com/saurabhshri/46e4069164b87a708b39d947e4527298

curl -L https://bootstrap.pypa.io/get-pip.py -o getpip.py
python get-pip.py --user
python -m pip install --user praw

For MAC:

https://gist.github.com/haircut/14705555d58432a5f01f9188006a04ed

Reddit video tutorials: https://www.youtube.com/watch?v=NRgfgtzIhBQ

Documentation: http://praw.readthedocs.io/en/latest/getting_started/quick_start.html

user_agent parameter?

Archival Bot
According to Reddit the user agent is required to determine basic information about the script accessing Reddit, such as the name, the version and the author (and that's supposed to go in there, such as "prawtutorial v1.0 by /u/sentdex" or similar).

r = praw.Reddit(client_id = '***',
                     client_secret = '***',
                     user_agent = 'Archival Bot')

Things to try:

tmozgach commented 6 years ago

Apparently, the all reddit comments are located in the BigQuery. Using SQL we can take. Example: Using BigQuery with Reddit Data https://pushshift.io/using-bigquery-with-reddit-data/ https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2017_12

Retrieving Data From Google BigQuery (Reddit Relevant XKCD) http://blog.reddolution.com/Life/index.php/2017/05/retrieving-data-from-google-bigquery-reddit-relevant-xkcd/ https://stackoverflow.com/questions/18493533/how-to-download-all-data-in-a-google-bigquery-dataset

Also, it seem that we can do NLP analyziz using Google NLP API. Example: Machine learning, NLP Google APIs https://hackernoon.com/how-people-talk-about-marijuana-on-reddit-a-natural-language-analysis-a8d595882a7a

neowangkkk commented 6 years ago

Wow. That's great! Very promising. Can we output the query result into R or python data format?

tmozgach commented 6 years ago

@neowangkkk probably there is a way, but I faced with another the issue, in BigQuery they store comments and POSTS for 2016,2017 year, but there is NO posts for 2015,2014,2013,2012, only comments! I mean, there is no main posts, title (from sender), only their comments (replies). I crawled data for 2012,2013,2014,2015 without Karma and another information about the user by now using the following script: https://github.com/peoplma/subredditarchive/blob/master/subredditarchive.py

tmozgach commented 6 years ago

@neowangkkk Could you provide the attributes that you need JUST for Topic Modeling?

tmozgach commented 6 years ago

Parse JSON to CSV Output the JSON file nicely:

python -m json.tool my_json.json

Useful links (weren't used): http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-json-to-csv-using-python/ https://www.dataquest.io/blog/python-json-tutorial/ https://medium.com/@gis10kwo/converting-nested-json-data-to-csv-using-python-pandas-dc6eddc69175

Parser is ready (using regular expression): https://github.com/tmozgach/ent_ob/blob/master/jsonTOcsvTopicModel.py

tmozgach commented 6 years ago

Raw information for 2012 - 2017 years: https://drive.google.com/open?id=1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ

tmozgach commented 6 years ago

@neowangkkk Files: TPostComRaw.csv: Not clean titles, main posts and comments for 2012-2017 years. TPostRaw.csv: Not clean titles and main posts WITHOUT comments for 2012-2017 years. https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing

tmozgach commented 6 years ago

Probably I need to join comments, post and title as ONE paragraph?

neowangkkk commented 6 years ago

Yes. As discussed last time, combining all texts in one thread may generate better outcome in clustering/topic modelling. Please go ahead and try it.

In addition, I checked the old data of karma. It is a fixed value for each individual at the point of crawling. We can’t get the changing karma through the time. So later if you can get the karma data for the participants in your investigated period, that would be fine.

On Feb 22, 2018, at 4:25 PM, tmozgach notifications@github.com wrote:

Probably I need to join comments, post and title as ONE paragraph?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tmozgach commented 6 years ago

Every row of that csv is one thread (title, post, comments).

tmozgach commented 6 years ago

Raw, not formated: 2009_2011data.csv.zip

tmozgach commented 6 years ago

2009 - 2017 all data:

https://www.dropbox.com/s/tlq6gfnnlnqvumx/data0312.csv?dl=0

tmozgach commented 6 years ago

Data for Topic Modeling 2009-2017: https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing newRawAllData.csv

tmozgach commented 6 years ago

Change NA in Title with previous Title. R

library(tidyverse)
library(zoo)
library(dplyr)

myDataf = read_delim("/home/tatyana/Downloads/data_full.csv", delim = ',' )
myDataff = myDataf[!is.na(strptime(myDataf$Date,format="%Y-%m-%d %H:%M:%S")),]

# There are title that duplicates another one. Titles are not unique
myDataff$Title <- make.unique(as.character(myDataff$Title), sep = "___-___")

# make.uniqui makes also NA - unique by adding number, need to transform them back to NA
myDataff$Title <- gsub("NA__+", NA, myDataff$Title)

# change NA by previous Title
myDataff['Title2'] = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>% 
  do(na.locf(.))
write_csv(myDataff, "data_full_title2.csv")

newDff = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>% 
  do(na.locf(.))
write_csv(newDff, "dataForPyth.csv")

Merge all comment and title in one document/row

Python:

import csv
import pandas as pd
import numpy as np

newDF = pd.DataFrame()
tit = ""
com = ""
rows_list = []
title_list = []
with open("/home/tatyana/dataForPyth.csv", "rt") as f:
    reader = csv.reader(f)
    for i, line in enumerate(reader):
        # print ('line[{}] = {}'.format(i, line))
        if i == 0:
            continue
        if i == 1:
            title_list.append(line[0])
            tit = line[0]
            com = line[0] + " " + line[1]
            continue

        if line[0] == tit:
            com = com + " " + line[1]
        else:
            rows_list.append(com)
            tit = line[0]
            title_list.append(line[0])
            com = line[0] + " " + line[1]

rows_list.append(com)

df = pd.DataFrame(rows_list)
se = pd.Series(title_list)
df['Topic'] = se.values

# print(title_list[84627])
# print(rows_list[84627])

df.to_csv("newRawAllData.csv",index=False, header=False) 

Topic Modeling and labeling;

Merge labeling and data_full_title2.csv;


myLabDataf = read_delim("/home/tatyana/nlp/LabeledTopic.csv", delim = ',' )

# 8 thredshad some issue and weren't mereged
newm = merge(myDataff,myLabDataf, by.x = 'Title2', by.y = 'title')

fin = select(newm, Date, Sender, Title2, Replier, Conversation,`Points from this question`, `Post Karma`, `Comment Karma`, `Date joining the forum;Category Label`, `Topic/Probability`, `Main Topic`, `Main Probability`)

write_csv(fin, "final.csv")
tmozgach commented 6 years ago

Latest data: https://www.dropbox.com/s/50vkf5makcojd5w/data_full.csv?dl=0