Closed ghost closed 4 years ago
I've found it to not be quite as simple as just concatenating the messages unfortunately. mbox seems to depend on the first line of every message being the From:
line. The crawler currently just downloads the raw message (which may or may not have the From:
line as the first line).
Something similar to this bash command (echo '/^[Ff][Rr][Oo][Mm]:/m0'; echo w; echo q ) | ed $file
will move the first matching From line to the top of the message in preparation for concating them into an mbox.
I'll maybe submit a PR for this if I get to a point where it seems to be working consistently.
Also potential issues around escaping other occurrences of From
to look at: http://fileformats.archiveteam.org/wiki/Mbox
Hi @beardyjay and @pmwheatley , I'll have a look at how I played with the mbox files. I didn't notice there is an issue in my project. I shoot myself in the foot with my notification setting :(
I think I also followed the same way as @pmwheatley suggested https://github.com/icy/google-group-crawler/issues/15#issuecomment-221018338 . I believe I did that with a little Ruby script, which I lost in my bunch of files now.
oh, I just found I also have a small script for converting https://github.com/icy/bashy/blob/master/libs/raw2mbox.sh , but I haven't used them for so long time. It's good for a reference purpose.
Hi,
First of thanks for these scripts, they were a huge help and this question is possibly out of scope of the project. I downloaded a group with around 90k messages with no issues; I adapted the wget.sh outputted script slightly with the modification provided in #32. All the messages are now in $GROUP/mbox formatted with RFC 822.
I am looking to convert this to an actual single mbox file format, the problem I am having is I can't get the format correct. I have tried just joining the individual files together but that does not create a valid mbox format.
I have also tried to format it using procmails formail.
while this command does work, it adds the current time to the FROM field instead of using the posted date. So when you open the file in say mutt, it shows the wrong date.
This command creates an invalid mbox file:
Any ideas how I can get this to a valid mbox format?