add ingestion function for reddit sharing topic

mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.

http://www.mediacloud.org

GNU Affero General Public License v3.0

280 stars 87 forks source link

add ingestion function for reddit sharing topic #598

Closed hroberts closed 4 years ago

hroberts commented 5 years ago

We would like to use reddit as our first non-twitter social sharing topic. This task is to add a function that queries the pushshift reddit api to return data specific to a topic in a form that can be plugged into a social web topic.

I am working on the surrounding infrastructure to stick generic social media post data into the database. That surrounding code will rely on a set of functions, one for each platform, that returns some set of posts that match a query and a date range. Let's put these platform querying functions under mediacloud/mediawords/social_query. The module should have the api and the platform in it, so for this case probably 'pushshift_reddit'. That module should have a single function ('query_reddit()?') that returns the matching posts.

The calling code will handle the work of knowing when to call that function to seed a topic based on the rows in topic_seed_queries associated with a given topic. The query function might also be used eventually by the front end api to return a sample list of results.

The query function should accept three arguments:

query - str form of the boolean query to run against the api, eg. 'trump' to search for all posts mentioning trump
start_date - date to start query, as a str in '2019-06-19' format.
end_date - date to end query, as a str in '2019-06-19' format. the results should include all results through this day
sample - limit the returned rows to this many. ideally return as quickly as possible with a random sample of rows of this size.

The query function should return a list of dicts, each with the following fields:

post_id - str containing the platform specific unique id for the post (eg. the tweet id for a tweet)
content - str containing the text content of the post (eg. the tweet test for a tweet)
publish_date - the date that the post was published, as a str in '2019-06-19 12:34:56' format
author - the author of the post (eg. the twitter user for a tweet)
channel - the channel or forum in which the post was published (eg. the subreddit for reddit)
data - a dict including the full, raw data returned by the underlying api (eg. the decoded json returned by the twitter api for a tweet), will be stored as a jsonb field in json

pushshift commented 5 years ago

query - str form of the boolean query to run against the api, eg. 'trump' to search for all posts mentioning trump

For searching posts, there are a lot of ways to search them but the two main ones are searching the title text and/or searching the selftext (which is sort of a description written when making a post). The description is not always present but the title always is. Should this search both fields or just the title?

hroberts commented 5 years ago

both fields

-hal

On Fri, Jul 5, 2019 at 9:44 AM Jason Michael Baumgartner < notifications@github.com> wrote:

query - str form of the boolean query to run against the api, eg. 'trump' to search for all posts mentioning trump

For searching posts, there are a lot of ways to search them but the two main ones are searching the title text and/or searching the selftext (which is sort of a description written when making a post). The description is not always present but the title always is. Should this search both fields or just the title?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_598-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T5MDJ57JEYRLT2XDWTP55M5BA5CNFSM4H6L33ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJWGWY-23issuecomment-2D508781403&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=NkP9R4haaHuT7hLcc7z17FELfW-luYCzqfek08Z9wjA&s=ImNUxMtc3gmg0DmV8e85aMap7JHIiNEt8lwSGQRBYS4&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66TZVFC2WK5G6UQQXK3TP55M5BANCNFSM4H6L33ZA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=NkP9R4haaHuT7hLcc7z17FELfW-luYCzqfek08Z9wjA&s=6YlPahJnuiz0Tjts3D-0DHmUstL8XueaARWr06llZ5I&e= .

rahulbot commented 5 years ago

This also needs to support selecting one or more subreddits by name, as per the initial mid-level planning.

hroberts commented 5 years ago

I'm not sure how to specify that as a generic 'social media post' thing to search for. Maybe allow a list of text 'channels', since that corresponds to the channel return field?

One alternative is just to allow the subreddit to be included in the text of the query, as we can do in the solr searches. Another alternative is to allow the query to be a dict with a universal 'text' field but also other fields, such as a 'subreddit' field, specific to given platform.

My intuition is that just adding an explicit 'channels' argument to the query_reddit() function is the best.

rahulbot commented 5 years ago

Yes, that channel idea should scale to other platforms - an array of strings I assume. That'd work for Reddit subs, 4chan "boards", Facebook groups, WhatsApp groups, and lots of other data sources people ask us for but we don't know if we can get ;-)

pushshift commented 5 years ago

I'm a few days out from completing the initial code that we can review but I have some followup questions. For right now, the important one is how many channels do we anticipate allowing as a query at once? From an Elasticsearch technical limitation, we could easily allow hundreds or thousands to be passed. However a GET request will have a query size limitation associated with it (~2,048 typically but of course varies).

If we anticipate hundreds of channels being passed, we could break through that limit and have to work with POST requests instead. If you don't think we'll have in the upper hundreds, this shouldn't be a problem. Currently, subreddits passed as a filter can be filtered with a comma delimited list using the subreddit parameter.

I just want to be mindful of how these channels will be passed (from Mediacloud front-end) and make sure the call is successful if a large number are sent.

hroberts commented 5 years ago

does your api support POSTed requests? if so, we should do the POST just to future proof.

On Tue, Jul 9, 2019 at 4:46 AM Jason Michael Baumgartner < notifications@github.com> wrote:

I'm a few days out from completing the initial code that we can review but I have some followup questions. For right now, the important one is how many channels do we anticipate allowing as a query at once? From an Elasticsearch technical limitation, we could easily allow hundreds or thousands to be passed. However a GET request will have a query size limitation associated with it (~2,048 typically but of course varies).

If we anticipate hundreds of channels being passed, we could break through that limit and have to work with POST requests instead. If you don't think we'll have in the upper hundreds, this shouldn't be a problem. Currently, subreddits passed as a filter can be filtered with a comma delimited list using the subreddit parameter.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_598-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T2OGHFLMXFS4JOVM3DP6RM6VA5CNFSM4H6L33ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZPXSLI-23issuecomment-2D509573421&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=R2mu8vNgXN5sJKm2qryxNQOILyIyQMtz5kJg9J07vJc&s=SkTSfF9GB6k73ZUOhVyXOW2uM58wtMc7YMRLfd2v45Y&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2JXSJJO6AZXRAVQDDP6RM6VANCNFSM4H6L33ZA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=R2mu8vNgXN5sJKm2qryxNQOILyIyQMtz5kJg9J07vJc&s=hiUCk1X_QqEvH26ZLKfPLrvxeNVPx-wUYJocNxlGijE&e= .

pushshift commented 5 years ago

Actually, Elasticsearch supports sending body data with GET requests. It's part of the security model (I'm assuming) since POST operations can change state within the cluster. What I've done with the code is to design it around elasticsearch so it is flexible and avoids issues with having a limited size for the search query, etc.

https://github.com/berkmancenter/mediacloud/blob/pushshift_social_query/mediacloud/mediawords/social_query/pushshift/reddit/

pushshift commented 5 years ago

The content field for each object returned gets assigned the title since the title is guaranteed to always be available (for Reddit submissions). However, the selftext also sometimes has more information, so I wasn't sure if we should append selftext to the title and put them both into the content field if selftext is available or just use title.

Also, this brings up a broader question. Reddit has two main divisions for submissions -- submissions that link to external things (news stories, websites, etc.) and submissions that are self posts that don't link to anything external. Do we only want to search the former? I don't know how helpful self posts would be to researchers (besides data that shows a topic may be trending for a given time range). If we choose to only search submissions that link to external things, that also solves having to deal with selftext since the selftext doesn't exist for submissions that link to external things.

hroberts commented 5 years ago

I think we should just prepend the title and selftext to the content. I'm ambivalent about whether it is worthwhile to store the title as a separate field. I'm working from an initial model of twitter, where a title obviously doesn't make sense. What analytical value would we get from treating the title separately?

Can't self posts have urls in the text of the post itself? If so, we should treat all submissions the same and return both the proper url submission, if it exists, and any urls that are in the text itself.

-hal

On Tue, Aug 6, 2019 at 9:33 AM Jason Michael Baumgartner < notifications@github.com> wrote:

The content field for each object returned gets assigned the title since the title is guaranteed to always be available (for Reddit submissions). However, the selftext also sometimes has more information, so I wasn't sure if we should append selftext to the title and put them both into content if selftext is available or just use title.

Also, this brings up a broader question. Reddit has two main divisions for submissions -- submissions that link to external things (news stories, websites, etc.) and submissions that are self posts that don't link to anything external. Do we only want to search the former? I don't know how helpful self posts would be to researchers (besides data that shows a topic may be trending for a given time range). If we choose to only search submissions that link to external things, that also solves having to deal with selftext since the selftext doesn't exist for submissions that link to external things.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_598-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T3PJLF6O6RBL7VHJP3QDGDT7A5CNFSM4H6L33ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3VKZ2A-23issuecomment-2D518696168&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=a-AO4PhZWB_JYCE2Klx25NATZlAhETPZKNJ-37VmFWI&s=YCsgpcG1yMCYw23DstfHtM3tmw8HXYnhPCgbRR3hgIU&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2GAFFFD37EALCNUVTQDGDT7ANCNFSM4H6L33ZA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=a-AO4PhZWB_JYCE2Klx25NATZlAhETPZKNJ-37VmFWI&s=4A1JuP6z2jcUkXv3lcQN5fa9Q57Q7X0G3twELdO_b3M&e= .

pushshift commented 5 years ago

Yes someone could put a url in the selftext. I don't think there's any real value in treating title and selftext and distinct -- only if the researcher needed to make that distinction. I will go ahead and change the code to append the selftext after the title if it exists.

Right now the code searches all submissions so we're good there.

hroberts commented 5 years ago

are you just searching submissions, or comments as well? we definitely want the ability to search everything, maybe with a way of just searching submissions as well.

-hal

On Tue, Aug 6, 2019 at 9:51 AM Jason Michael Baumgartner < notifications@github.com> wrote:

Yes someone could put a url in the selftext. I don't think there's any real value in treating title and selftext and distinct -- only if the researcher needed to make that distinction. I will go ahead and change the code to append the selftext after the title if it exists.

Right now the code searches all submissions so we're good there.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_598-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66TYI3HGLSMWOBLUBXUDQDGFVXA5CNFSM4H6L33ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3VMWUY-23issuecomment-2D518703955&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=DBcsU6QhinXRKGD8ZY8wF4174eBYe6xf1CCqcT_TyTE&s=D6nv0ynIAoak51B2_EkV8QrEYWw6UNJ4Ncr4kcJFuks&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T52HPLYCYUQFCEDDATQDGFVXANCNFSM4H6L33ZA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=DBcsU6QhinXRKGD8ZY8wF4174eBYe6xf1CCqcT_TyTE&s=j-banUCflq7VsaTTB6D8v5tlUyXLJOBonRaFPLrXG_I&e= .

pushshift commented 5 years ago

The current code only searches submissions but adding code to search the comments would be trivial. That said, if we still want to produce randomized samples that contain large date ranges, comments are a magnitude more numerous than submissions so I would have to look at the feasibility of supporting randomized samples -- it really depends on how many queries we expect per day (if it's 10-50k, I don't see a large problem).

Calls requesting a date range for the entirety of Reddit's history on a very common word might even time out if a random sample is requested (unless we're willing to fudge what we call a random sample where every document doesn't have to be visited within Lucene).

Do you want to add comment search as a separate call method call?

hroberts commented 5 years ago

adding comment search as a separate call is probably the cleanest. the other way would be to add a flag in the query itself, but that seems messier.

the queries hitting this call for now will be very rare. we will basically just be using it either to create topics or to preview before creating a topic. down the road, we might plug it in to the explorer interface to allow things like attention comparison over time for various platforms. even then, our daily usage is in the small thousands of hits.

the random sample thing is not vital. the idea is just to be able to provide to the user a preview of the results. an easy fix would be to just select a random sample out of the first 100k results, or something like that.

-hal

On Tue, Aug 6, 2019 at 11:02 AM Jason Michael Baumgartner < notifications@github.com> wrote:

The current code only searches submissions but adding code to search the comments would be trivial. That said, if we still want to produce randomized samples that contain large date ranges, comments are a magnitude more numerous than submissions so I would have to look at the feasibility of supporting randomized samples -- it really depends on how many queries we expect per day (if it's 10-50k, I don't see a large problem).

Calls requesting a date range for the entirety of Reddit's history on a very common word might even time out if a random sample is requested (unless we're willing to fudge what we call a random sample where every document doesn't have to be visited within Lucene).

Do you want to add comment search as a separate call method call?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_598-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T23XJ7BM5FHLWZYCITQDGOCDA5CNFSM4H6L33ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3VUG3A-23issuecomment-2D518734700&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=tYZuTt85KNMu4SsoFUvwNzMWvNInmAFZr6hjz_M8cZw&s=exMfH0Tlx0vHAyhr858Y2HQ3LjzXSGE7_-eWgSJ7yew&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T4HKGEX2PZ5DK5RDVTQDGOCDANCNFSM4H6L33ZA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=tYZuTt85KNMu4SsoFUvwNzMWvNInmAFZr6hjz_M8cZw&s=UhRnyN4miK5RYb4xHs_hVJNDaFKa8GYsrFW1vZwBXyQ&e= .

rahulbot commented 4 years ago

I think reddit integration is working and deployed already.