icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
215 stars 38 forks source link

Export content between date #25

Closed gagarine closed 4 years ago

gagarine commented 6 years ago

I want to export soc.culture.soviet but it's big... In fact I'm only interested of a five months period. I didn't see a way to export only between a specific period.

icy commented 6 years ago

Hi @gagarine,

Do you know the page range? Google group pagination works on numbers (e.g, 30 posts per page), and that doesn't have anything related to date. If you navigate the group contents to see an approximate number that would help, e.g,

https://groups.google.com/forum/?_escaped_fragment_=forum/archlinuxvn%5B21-40%5D

However, this doesn't really work all the time. Users can post to very old thread, and what we can see from e.g, the link above, is the dates of the last posts; it's not the date of the first posts in the thread. If you really like to work this way I can have a simple patch.

The group soc.culture.soviet that you mentioned has about 28k topics (the number of messages is bigger of course), and to fetch all these 28k topics that would take few hours (assuming that Google doesn't have any kind of throttle number). I think that's reasonable...

gagarine commented 6 years ago

Mmmmh I understand. Using the Google Group web interface, I was using a filter on "first post" and looked between 1991 january to 1991 december. My primary interest is around august so that would be

https://groups.google.com/forum/?_escaped_fragment_=forum/soc.culture.soviet[27630-27720]

Yeah, I saw the import was faster that I tough. I tried a full import but it seem that google killed my connection.

I will play a bit more to see if I can do a full import. Perhaps it easier. Mainly I don't want to not have a message because someone posted on the thread later on.

icy commented 6 years ago

It's interesting to hear that Google killed your connection :) I am not sure if adding some sleep to the hook can help (https://github.com/icy/google-group-crawler#the-hook)

I will try to have a range support for the script, so that you can specify 27630-27720 as input.

icy commented 4 years ago

Someone can download 350k messages from a group (https://github.com/icy/google-group-crawler/issues/32), maybe this isn't an issue so far. As I didn't intend to have pagination support, I will close this ticket. Free free to reopen it if there is any better idea to support the feature.

Thanks a lot.