icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
215 stars 38 forks source link

Cookies don't seem to be working.. #24

Closed rhukster closed 5 years ago

rhukster commented 6 years ago

I'm trying to grab the contents of a private Google group we've been using as a group inbox, and create an mbox file so we can import the messages back into an IMAP account.

I've followed the instructions, and even when I grab the cookies via multiple ways (firefox with cookie exporter, chrome with cookies.txt plugin), then set my wget options, i always get the same response from wget:

: Creating './devs//threads/t.3' with 'forum/devs'
:: Fetching data from 'https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_=forum/devs'...
--2018-03-22 19:40:27--  https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_=forum/devs
Resolving groups.google.com (groups.google.com)... 108.177.112.113, 108.177.112.139, 108.177.112.102, ...
Connecting to groups.google.com (groups.google.com)|108.177.112.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://accounts.google.com/AccountChooser?continue=https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_%3Dforum/devs&hl=en&service=groups2&hd=mycompany.com [following]
...

It get's stuck in this loop because it's not authenticating and getting redirected to the AccountChooser page.

I can access the https://groups.google.com/a/mycompany.com/d/__FRAGMENT__?_escaped_fragment_=forum/devs URL in my browser, but i can't with wget even directly in the command line (same error).

Any ideas would be appreciated!

rhukster commented 6 years ago

BTW used this Firefox extension: https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/?src=search

And this Chrome one: https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg and this one: http://www.editthiscookie.com/

No dice...

icy commented 6 years ago

Hi @rhukster ,

I'm sorry for any inconvenience. Did you use _GROUP variable to specify your company information? (e.g, export _GROUP=mycompany.com).

I will give some tests with private group in a organization today.

Thanks

rhukster commented 6 years ago

No, I used _ORG for that:

export _GROUP="devs"
export _ORG="mycompany.com"
export _WGET_OPTIONS="--load-cookies /my/path/to/cookies.txt --keep-session-cookies --verbose"
icy commented 6 years ago

@rhukster You're right. Please make sure _ORG's value is in lowercase. (See also #22.)

I have some problem setting up the business plan for my org, which is required for the test. Stay tuned.

Thanks

icy commented 6 years ago

I can reproduce the problem now (=_ORG`'s value is lowercase). I am taking further look at this issue. Thanks for your patience

icy commented 6 years ago

I'm pretty sure that the script will not work with (new) Organization groups: They are written in new web framework (single-page application). This is similar to issue reported on #14. Let me see if there is any work-around.

icy commented 6 years ago

Related issue: https://productforums.google.com/forum/#!topic/apps/wqY5t0D70qw

icy commented 6 years ago

@rhukster Good news for you. The addons https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/?src=search generates some weird output. You can fix as below

  1. Generate cookie file by using that addons (cookies-txt)
  2. Open the file, and remove all strings #HttpOnly_
  3. Remove all temporary directory (the script would create devs directory in your working directory), and try again.

I have tested and it's working very well at my side. Hope this also helps you :)

icy commented 6 years ago

Changes:

Feel free to reopen the ticket if there is any looping issue. Thanks a lot.

jpellman commented 6 years ago

I seem to be encountering this issue as well as of this morning, which is strange since I was able to get this to work without error on 10/2/18.

jpellman commented 6 years ago

It seems like this was just an issue with my cookies.txt file- I was missing the groupsloginpref cookie for some reason and that seemed to be the source of my issue (which was more or less identical to the first code block in this issue).

It might be worth mentioning in the Readme the exact cookies that are needed for private group scraping to work, according to here, these are: SID, HSID, SSID, and groupsloginpref.

icy commented 6 years ago

Thanks a lot for your very useful feedback @jpellman . I will update README accordingly.

jpellman commented 6 years ago

Ach- I finally figured out what this was. Basically, my issue was that I wasn't reading the instructions properly. I somehow misconstrued "When you have the file, please open it and remove all #HttpOnly strings." in the README to mean "remove all lines starting with #HttpOnly" when it meant "find all instances of #HttpOnly_ and replace them with an empty string". It might be worth adding a sed command under there to reinforce that you're doing string replacement and not line removal. Maybe something like:

sed -i -e 's/#HttpOnly_//g' cookies.txt

Sorry for any noise.

icy commented 6 years ago

Never mind @jpellman . English is not my primary language and I may always confuse anyone ;) I've updated README as you suggested :) Thx again.

icy commented 5 years ago

Cookies don't seem to be working... Google now has denied to crawler lolz