intro-to-ml-with-kubeflow / intro-to-ml-with-kubeflow-examples

[WIP] Examples for the Intro to ML with Kubeflow book
Apache License 2.0
204 stars 83 forks source link

Chapter 5 data extraction code does not work as Apache mailing list has changed its interface #143

Open oonisim opened 2 years ago

oonisim commented 2 years ago

The Apache mailing list has changed its interface, and it is not anymore mod_mbox of Apache HTTP, hence url like http://mail-archives.apache.org/mod_mbox/spark-dev/201911.mbox/ajax/thread?0 will cause the error because of /ajax part.

image

By removing /ajax, the url http://mail-archives.apache.org/mod_mbox/spark-dev/201911.mbox/thread?0 mailing list URL redirect to new interface dev@spark.apache.org, November 2019 but it does not provide MBOX format listing, hence cannot extract the MBOX format elements such as FROM, TO, SUBJECT.

The thread ID pattern is now different too, e.g. https://lists.apache.org/thread/hg85hhvt270of8fdrmb62kfvm7rpl96p.

webmakaka commented 2 years ago

@oonisim Hi!

Is this the end of study for this book or any chances to get the data?

webmakaka commented 2 years ago

May be here is our mailing list https://lists.apache.org/list.html?dev@spark.apache.org

But how to get data in xml format?

And we can download mails as mbox archive (I do not know anything about this format).