pistocop / subreddit-comments-dl

Download subreddit comments
https://www.pistocop.dev/posts/subreddit_downloader/
GNU General Public License v3.0
90 stars 26 forks source link

can't scrape past a certain date #8

Open gtrane opened 1 year ago

gtrane commented 1 year ago

Hi, First of all, thank you for this great program! I have used your code successful for scraping a subreddit from specific utc date ranges. However, I have encountered a problem where I can't scrape anything past the UTC: 1670743183

my input to terminal: python src/subreddit_downloader.py --reddit-id --reddit-secret --reddit-username --debug --batch-size 500 --utc-after 1670743183

The error is below. I have no idea why this is occurring, any advice would be greatly appreciated! Thank you.

subreddit_downloader.py 308 typer.run(main)

main.py 859 run app()

main.py 214 call return get_command(self)(*args, **kwargs)

core.py 829 call return self.main(*args, **kwargs)

core.py 782 main rv = self.invoke(ctx)

core.py 1066 invoke return ctx.invoke(self.callback, **ctx.params)

core.py 610 invoke return callback(*args, **kwargs)

main.py 497 wrapper return callback(**use_params) # type: ignore

contextlib.py 79 inner return func(*args, **kwds)

subreddit_downloader.py 299 main assert utc_lower_bound < utc_upper_bound, f"utc_lower_bound '{utc_lower_bound}' should be " \

TypeError: '<' not supported between instances of 'NoneType' and 'str'

pistocop commented 1 year ago

Hi @gtrane,

Before go into details, I see that you need to "I can't scrape anything past the UTC: 1670743183", but in the code you are using "--utc-after 1670743183", I think you should instead use the program's argument --utc-before.

Anyway, I think the program is returning and empty iterator here: https://github.com/pistocop/subreddit-comments-dl/blob/a9f02a0a041be3b1f425b4cce1e57e658a737754/src/subreddit_downloader.py#L281

I think you should explore what pushshift is returning, e.g. if you execute the following code, what's the value of the variable submissions_generator?

direction = "before"
utc_lower_bound="1670743183"
submissions_generator = pushshift_api.search_submissions(subreddit=subreddit,
                                                                     limit=batch_size,
                                                                     sort='desc' if direction == "before" else 'asc',
                                                                     sort_type='created_utc',
                                                                     after=utc_upper_bound if direction == "after" else None,
                                                                     before=utc_lower_bound if direction == "before" else None,
                                                                     )
gtrane commented 1 year ago

Where do I execute this code? Within subreddit_downloader.py or in my terminal? Thank you!

smukherjee30 commented 1 year ago

Hello,

Thanks for this code. It has been really helpful for students like me. But unfortunately, it is the same problem for me.

Whenever I try to extract data for any date post 11 December 2022, it returns empty files. I am not sure as to why this could be happening. Do you have any idea?

Looking forward to your response! Thank you so much for all your efforts and this brilliant piece of code.

smukherjee30 commented 1 year ago

Below is a piece of code I have written to convert the timestamps in the output files to proper date formats. Attaching it here, in case this comes in handy for anyone who wants to verify the dates post extraction.

import pandas as pd df=pd.read_csv('submissions.csv') df1=pd.read_csv('comments.csv')

import datetime for x in list(df['created_utc']): datetime_obj=datetime.datetime.fromtimestamp(x) df.loc[df['created_utc'] == x,'created_utc'] = datetime_obj

for y in list(df1['created_utc']): datetime_obj1=datetime.datetime.fromtimestamp(y) df1.loc[df1['created_utc'] == y,'created_utc'] = datetime_obj1

df.to_csv('submissions_date_converted.csv') df1.to_csv('comments_date_converted.csv')

pistocop commented 1 year ago

Hi @gtrane @smukherjee30,

I have tested and investigated the issue, looks like there is an important change in the pushshift system [1] that involve malfunctions [2] with some "edge" cases. There is tracked also an issue like yours, although with different date ranges [3].

What I will do is wait until the pushshift migration is terminated and then check if some APIs are changed and potentially patch the code. Meanwhile, if you require the data now and cannot wait, there is a post on Reddit where people talk about possible alternatives.

Hope this info could be helpful!

[1] https://www.reddit.com/r/pushshift/comments/zkggt0/update_on_colo_switchover_bug_fixes_reindexing/ [2] https://github.com/pushshift/api/issues [3] https://github.com/pushshift/api/issues/132