icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
215 stars 38 forks source link

How to determine how many messages were pulled in? #41

Closed ricks03 closed 3 years ago

ricks03 commented 3 years ago

My number of files in mbox, and my number of messages in the google group, aren't the same. What's the best way to determine why?

icy commented 3 years ago

My number of files in mbox, and my number of messages in the google group, aren't the same. What's the best way to determine why?

May you share the number of the difference? How many lines did you get in the output script? My sample script as below

#!/usr/bin/env bash

export _ORG="${_ORG:-}"
export _GROUP="${_GROUP:-bbedit}"
export _D_OUTPUT="${_D_OUTPUT:-./bbedit/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0}"
export _CURL_OPTIONS="${_CURL_OPTIONS:-}"

__curl_hook () 
{ 
    :
}
__curl__ () 
{ 
    if [[ ! -f "$1" ]]; then
        echo ":: Downloading '$1'..." 1>&2;
        curl -Ls -A "$_USER_AGENT" $_CURL_OPTIONS "$2" -o "$1";
        __curl_hook "$1" "$2";
    else
        echo ":: Skipping '$1'..." 1>&2;
    fi
}
__curl__ "./bbedit//mbox/m.00ZxvsSgSx0.6kyK1BoUizkJ" "https://groups.google.com/forum/message/raw?msg=bbedit/00ZxvsSgSx0/6kyK1BoUizkJ"
__curl__ "./bbedit//mbox/m.00ZxvsSgSx0.fiOWi-cJqykJ" "https://groups.google.com/forum/message/raw?msg=bbedit/00ZxvsSgSx0/fiOWi-cJqykJ"
# ...

If there is any mismatch (number of __curl__ in the output script vs the number of messages in the google group, I'd suggest you to rerun the process, i.e, delete all local (cache) files, before you start.

I haven't seen that issue so far. The best thing is to run the script in verbose mode, for example, you can rerun the the whole process, and try bash -x output-script.sh.

edit: fix typo errors

ricks03 commented 3 years ago

My curl file shows 17625 lines all told. My google group shows 5861 messages. I have 17593 files in the folder on the server. (which is about right for the number of lines in the curl file).

My best guess is that each curl line is a message, but the google group shows the number as threads.

icy commented 3 years ago

Right the output script contains curl commands to download messages (emails). Each thread (topic?) in your google group may contain multiple messages. I wrote down what I knew about google group in the code too:

https://github.com/icy/google-group-crawler/blob/c183ffd17f9e871a79a0429a841dd6b87829bc4b/crawler.sh#L28

Hope this helps to explain your issue.

Edit: Fix typo errors

icy commented 3 years ago

My best guess is that each curl line is a message, but the google group shows the number as threads.

Oh , how many files did you see in the threads/ folder? Basically there are three folders (threads, msgs, mbox). Maybe one of them matches your expected number...

ricks03 commented 3 years ago

Threads is 293. mbox is 17593. msgs is 5880. Aha! So that's where it is, in messages. Thx.

icy commented 3 years ago

Great you've found that ;)

threads is an attempt to find all threads (topics) id. So if you get number of lines in all files in threads, you almost get the right number.

Each file in msgs contains all links to messages within each thread. It comes with pagination so the number may vary, and it's often greater than number of threads you have. (5861 threads with pagination --> 5880 I guess.)

The last one, mbox, contains all individual emails/messages in the whole group, and it's often a lot.