Open tmozgach opened 6 years ago
Glad to form a work team here. The first job is that we want to crawl more data from the target community. https://www.reddit.com/r/Entrepreneur/
The time period we want to crawl is from 2012.1.1 to 2017.12.31.
The required variables are shown in the attachment. If you have any questions, please let me know.
@neowangkkk I haven't done it before, so in order to save time and be consistence could you please give me the script or method/web link how you did it before. Is that like that? https://www.labnol.org/internet/web-scraping-reddit/28369/
@tmozgach Last time I paid $150 to hire a part-time programmer to crawl the data. He told me he used C+ language to write the program. I don't believe he will give me the code:-(
I am not sure about the difficulty level of doing web crawling at reddit.com. Can you please search and check if any python package or sth else can do it? If after 20 work hours it is still a problem, we may go to the part-time programmer again. I understand some websites put a lot of tricks to prevent people crawling their content. It may be a huge task that only people with years of experience in crawling can handle. But it is worthy for you to learn and try when our time still allows.
The google sheet method in your link may have some flaws. It said sub-reddit can only show 1000 posts. But our last crawling got over 25,000 threads for 18 months.
If you have any questions, please feel free to let me know.
PRAW API
Install pip
and draw
without root for LINUX:
https://gist.github.com/saurabhshri/46e4069164b87a708b39d947e4527298
curl -L https://bootstrap.pypa.io/get-pip.py -o getpip.py
python get-pip.py --user
python -m pip install --user praw
For MAC:
https://gist.github.com/haircut/14705555d58432a5f01f9188006a04ed
Reddit video tutorials: https://www.youtube.com/watch?v=NRgfgtzIhBQ
Documentation: http://praw.readthedocs.io/en/latest/getting_started/quick_start.html
user_agent
parameter?
Archival Bot
According to Reddit the user agent is required to determine basic information about the script accessing Reddit, such as the name, the version and the author (and that's supposed to go in there, such as "prawtutorial v1.0 by /u/sentdex" or similar).
r = praw.Reddit(client_id = '***',
client_secret = '***',
user_agent = 'Archival Bot')
Things to try:
Apparently, the all reddit comments are located in the BigQuery. Using SQL
we can take.
Example:
Using BigQuery with Reddit Data
https://pushshift.io/using-bigquery-with-reddit-data/
https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2017_12
Retrieving Data From Google BigQuery (Reddit Relevant XKCD) http://blog.reddolution.com/Life/index.php/2017/05/retrieving-data-from-google-bigquery-reddit-relevant-xkcd/ https://stackoverflow.com/questions/18493533/how-to-download-all-data-in-a-google-bigquery-dataset
Also, it seem that we can do NLP analyziz using Google NLP API. Example: Machine learning, NLP Google APIs https://hackernoon.com/how-people-talk-about-marijuana-on-reddit-a-natural-language-analysis-a8d595882a7a
Wow. That's great! Very promising. Can we output the query result into R or python data format?
@neowangkkk probably there is a way, but I faced with another the issue, in BigQuery they store comments and POSTS for 2016,2017 year, but there is NO posts for 2015,2014,2013,2012, only comments! I mean, there is no main posts, title (from sender), only their comments (replies). I crawled data for 2012,2013,2014,2015 without Karma and another information about the user by now using the following script: https://github.com/peoplma/subredditarchive/blob/master/subredditarchive.py
@neowangkkk Could you provide the attributes that you need JUST for Topic Modeling?
Parse JSON to CSV Output the JSON file nicely:
python -m json.tool my_json.json
Useful links (weren't used): http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-json-to-csv-using-python/ https://www.dataquest.io/blog/python-json-tutorial/ https://medium.com/@gis10kwo/converting-nested-json-data-to-csv-using-python-pandas-dc6eddc69175
Parser is ready (using regular expression): https://github.com/tmozgach/ent_ob/blob/master/jsonTOcsvTopicModel.py
Raw information for 2012 - 2017 years: https://drive.google.com/open?id=1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ
@neowangkkk Files: TPostComRaw.csv: Not clean titles, main posts and comments for 2012-2017 years. TPostRaw.csv: Not clean titles and main posts WITHOUT comments for 2012-2017 years. https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing
Probably I need to join comments, post and title as ONE paragraph?
Yes. As discussed last time, combining all texts in one thread may generate better outcome in clustering/topic modelling. Please go ahead and try it.
In addition, I checked the old data of karma. It is a fixed value for each individual at the point of crawling. We can’t get the changing karma through the time. So later if you can get the karma data for the participants in your investigated period, that would be fine.
On Feb 22, 2018, at 4:25 PM, tmozgach notifications@github.com wrote:
Probably I need to join comments, post and title as ONE paragraph?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Every row of that csv is one thread (title, post, comments).
Raw, not formated: 2009_2011data.csv.zip
2009 - 2017 all data:
Data for Topic Modeling 2009-2017: https://drive.google.com/drive/folders/1S8iAnnyjH4NxZpXmkX0ppN9c_Hm6XJLQ?usp=sharing newRawAllData.csv
Change NA in Title with previous Title. R
library(tidyverse)
library(zoo)
library(dplyr)
myDataf = read_delim("/home/tatyana/Downloads/data_full.csv", delim = ',' )
myDataff = myDataf[!is.na(strptime(myDataf$Date,format="%Y-%m-%d %H:%M:%S")),]
# There are title that duplicates another one. Titles are not unique
myDataff$Title <- make.unique(as.character(myDataff$Title), sep = "___-___")
# make.uniqui makes also NA - unique by adding number, need to transform them back to NA
myDataff$Title <- gsub("NA__+", NA, myDataff$Title)
# change NA by previous Title
myDataff['Title2'] = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>%
do(na.locf(.))
write_csv(myDataff, "data_full_title2.csv")
newDff = data.frame(col1 = myDataff$Title, col2 = myDataff$Conversation) %>%
do(na.locf(.))
write_csv(newDff, "dataForPyth.csv")
Merge all comment and title in one document/row
Python:
import csv
import pandas as pd
import numpy as np
newDF = pd.DataFrame()
tit = ""
com = ""
rows_list = []
title_list = []
with open("/home/tatyana/dataForPyth.csv", "rt") as f:
reader = csv.reader(f)
for i, line in enumerate(reader):
# print ('line[{}] = {}'.format(i, line))
if i == 0:
continue
if i == 1:
title_list.append(line[0])
tit = line[0]
com = line[0] + " " + line[1]
continue
if line[0] == tit:
com = com + " " + line[1]
else:
rows_list.append(com)
tit = line[0]
title_list.append(line[0])
com = line[0] + " " + line[1]
rows_list.append(com)
df = pd.DataFrame(rows_list)
se = pd.Series(title_list)
df['Topic'] = se.values
# print(title_list[84627])
# print(rows_list[84627])
df.to_csv("newRawAllData.csv",index=False, header=False)
Topic Modeling and labeling;
Merge labeling and data_full_title2.csv;
myLabDataf = read_delim("/home/tatyana/nlp/LabeledTopic.csv", delim = ',' )
# 8 thredshad some issue and weren't mereged
newm = merge(myDataff,myLabDataf, by.x = 'Title2', by.y = 'title')
fin = select(newm, Date, Sender, Title2, Replier, Conversation,`Points from this question`, `Post Karma`, `Comment Karma`, `Date joining the forum;Category Label`, `Topic/Probability`, `Main Topic`, `Main Probability`)
write_csv(fin, "final.csv")
*