icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
215 stars 38 forks source link

Mbox format #35

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi,

First of thanks for these scripts, they were a huge help and this question is possibly out of scope of the project. I downloaded a group with around 90k messages with no issues; I adapted the wget.sh outputted script slightly with the modification provided in #32. All the messages are now in $GROUP/mbox formatted with RFC 822.

I am looking to convert this to an actual single mbox file format, the problem I am having is I can't get the format correct. I have tried just joining the individual files together but that does not create a valid mbox format.

find $GROUP/mbox/ -type f | while read f; do cat $f >> tmp.mbox; done

I have also tried to format it using procmails formail.

for f in $GROUP/mbox/*; do formail -b < "$f" >> test2.mbox; done

while this command does work, it adds the current time to the FROM field instead of using the posted date. So when you open the file in say mutt, it shows the wrong date.

for f in $GROUP/mbox/*; do formail -a "Date:" < "$f" >> test2.mbox; done

This command creates an invalid mbox file:

mutt -f test2.mbox Invalid mbox format

Any ideas how I can get this to a valid mbox format?

pmwheatley commented 4 years ago

I've found it to not be quite as simple as just concatenating the messages unfortunately. mbox seems to depend on the first line of every message being the From: line. The crawler currently just downloads the raw message (which may or may not have the From: line as the first line).

Something similar to this bash command (echo '/^[Ff][Rr][Oo][Mm]:/m0'; echo w; echo q ) | ed $file will move the first matching From line to the top of the message in preparation for concating them into an mbox.

I'll maybe submit a PR for this if I get to a point where it seems to be working consistently.

pmwheatley commented 4 years ago

Also potential issues around escaping other occurrences of From to look at: http://fileformats.archiveteam.org/wiki/Mbox

icy commented 4 years ago

Hi @beardyjay and @pmwheatley , I'll have a look at how I played with the mbox files. I didn't notice there is an issue in my project. I shoot myself in the foot with my notification setting :(

icy commented 4 years ago

I think I also followed the same way as @pmwheatley suggested https://github.com/icy/google-group-crawler/issues/15#issuecomment-221018338 . I believe I did that with a little Ruby script, which I lost in my bunch of files now.

icy commented 4 years ago

oh, I just found I also have a small script for converting https://github.com/icy/bashy/blob/master/libs/raw2mbox.sh , but I haven't used them for so long time. It's good for a reference purpose.