icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
215 stars 38 forks source link

Scrapper running forever, not generating tables #18

Closed spacewaffle closed 4 years ago

spacewaffle commented 7 years ago

So I've got the cookie setup and I've actually successfully used the scrapper before. I've been trying to replicate what I did before but I'm running into some issues. The scrapper will run forever with logs like below.

2016-12-21 12:26:09 (1.27 MB/s) - written to stdout [60251]

:: Creating './workbarbiz//threads/t.129' with 'forum/workbarbiz'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz'...
--2016-12-21 12:26:09--  https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz
Resolving groups.google.com... 74.125.192.139, 74.125.192.101, 74.125.192.138, ...
Connecting to groups.google.com|74.125.192.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1 [following]
--2016-12-21 12:26:09--  https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1
Resolving accounts.google.com... 216.58.219.237, 2607:f8b0:4006:80b::200d
Connecting to accounts.google.com|216.58.219.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

-                                       [ <=>                                                             ]  58.68K  --.-KB/s    in 0.04s   

2016-12-21 12:26:09 (1.39 MB/s) - written to stdout [60091]

:: Creating './workbarbiz//threads/t.130' with 'forum/workbarbiz'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz'...
--2016-12-21 12:26:09--  https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz
Resolving groups.google.com... 74.125.192.139, 74.125.192.101, 74.125.192.138, ...
Connecting to groups.google.com|74.125.192.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1 [following]
--2016-12-21 12:26:09--  https://accounts.google.com/ServiceLogin?service=groups2&passive=1209600&continue=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&followup=https://groups.google.com/forum/?_escaped_fragment_%3Dforum/workbarbiz&authuser=1
Resolving accounts.google.com... 216.58.219.237, 2607:f8b0:4006:80b::200d
Connecting to accounts.google.com|216.58.219.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

I eventually killed the process because it was taking multiple hours even though our google groups forum doesn't have that many posts. I found that there were thousands of thread files generated but nothing in msgs, nothing in mbox, and no db file generated after scraping. Every thread file had the same single line of text:

https://groups.google.com/forum/?_escaped_fragment_=forum/workbarbiz

Any idea what's going on here? Also not sure if this changes things but the cookie I'm using is pretty old.

icy commented 7 years ago

It seems the service requires login and your cookies don't work. Let me figure it out with a local test.

icy commented 7 years ago

I've tested with a private group and I don't see a similar problem. Maybe your cookies are expired. Could you please confirm that your cookies are still valid? Thx

rhukster commented 6 years ago

BTW I have the same issue.. perhaps because both of these are groups on Google Apps (Business) accounts and not simply private groups.

icy commented 6 years ago

@rhukster @spacewaffle It seems there is a problem with cookie file generated by browser's extension. See also https://github.com/icy/google-group-crawler/issues/24#issuecomment-375856663 . I have updated README.md accordingly.

Thanks a lot.

icy commented 4 years ago

The problem probably was that the script couldn't detect the loop in case of invalid cookie is provided. That'd be fixed now. Moreoever, the new version is using curl with better cookie string settings instead of netscape cookie file with wget. Please try it out.

Thanks a lot.