icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
215 stars 38 forks source link

Can not get all the mbox #21

Closed demostars closed 6 years ago

demostars commented 6 years ago

Thank you for your work.

I try to get a google group named fanfou-digest which has 1373 topics but with this crawler there are 21 files in mbox folder, the newest 21 topics, is this a bug or did i do something wrong? Can i get 1373 files with this crawler?

icy commented 6 years ago

Hi @demostars,

Is it this public group https://groups.google.com/forum/#!forum/fanfou-digest ? I will give a try if something wrong happens.

demostars commented 6 years ago

Yes it is a publiuc group and accounding to the file in threads folder, it stoped when there is another page to continue, 1510736561 1

alse i try to run crawler with cookie,same question, is there any other group has the same question? doesn't it come with pages?

icy commented 6 years ago

@demostars The script can work with paging: as long as it sees something like _escaped_fragment_ in the output it will try to follow that link. In case it doesn't work, maybe there is an encoding issue.

I haven't tested the script on any encoding other than UTF-8. Maybe you force the script to use that

$ LANG=en_US.UTF-8 ./crawer.sh -sh

This will not hurt any contents as I understand. My testing script is still running with your group I will let you know if I have the same issue.

demostars commented 6 years ago

Well, that make sense, i am running Ubuntu with Chinese, but still does not work, i will try to run the crawler on another Ubuntu with English.

Thank you!

icy commented 6 years ago

@demostars On an Ubuntu system with US locales, I can see the script works well (so far). The generated script is here https://gist.github.com/icy/6f8683ded1ce0860dad387f8d845aa64 . You may use them to download all (almost) original mbox files.

Hope this helps.

demostars commented 6 years ago

Thank you, i tryed to run On an Ubuntu system with US locales, the same issue, so strange.

Thank you so much!

icy commented 6 years ago

weird. maybe you have any problem with your network? Maybe you send me the output for further debugging? Or you can try the script above that's the complete list of all messages that you need to download from the current archive.

demostars commented 6 years ago

Yep, with your script above i have downloaded all the archive, what output do you need? In the terminal or in the folder?

icy commented 6 years ago

@demostars How is thing going? Are you able to use the script?

I have added some tests and the result is found on Travis-CI, for example https://travis-ci.org/icy/google-group-crawler/jobs/348543280

demostars commented 6 years ago

Thanks,that works on me.

I think you can close this issues.