python / psf-salt

PSF infrastructure configuration
MIT License
111 stars 57 forks source link

Generating a static archive of bugs.python.org and bugs.jython.org #370

Open ewdurbin opened 1 month ago

ewdurbin commented 1 month ago

Currently bugs.python.org and bugs.jython.org have both been deprecated in favor of Github Issues.

The messages/files/urls need to remain online in perpetuity as a reference (we don't want to break old links!).

We should investigate methods of creating a static archive of the sites for this purpose, to avoid the need to maintain the installations forever.

cc @rouilj

ewdurbin commented 1 month ago

@JacobCoffee, @rouilj and I had a conversation on IRC a while back and John was tracking a request from another user regarding static archives of roundup instances.

rouilj commented 1 month ago

@ewdurbin, I have not had any luck tracking down the user from the prior discussion.

However, some things to consider:

How much of the msgs/files/urls do you want to keep? Do user### urls matter? Do any other class urls matter?

As a first pass consider scraping all the issue### urls. The issue url's include the sender, date and body of the msgs. Do you need to support the msg2345 url if issue123 displays msg2345?

If not this becomes easier. If you do need to keep the msgs (e.g. for recipient list, or exact formatting, or because you expect links to msgs to be used), you could scrape all the msg#### url's as well.

For files, how much of the metadata do you need? Scraping the files#### url and then placing the actual files in a subdirectory of files####/filename might work. I don't remember the exact structure of the download links that we would have to replicate on disk.

One issue might be setting the mime type for the attached files. If you can live with all files having the application/octet-stream mime type we could get away without having to munge the download links on the generated HTML pages (files##, msgs##, issue##) to include a type (.pdf, .jpg ...) extension.

To preserve internal links (e.g. issue123 references issue456) , we would need to make the url b.p.o/issue23 resolve to issue23.html. This will make the web server serve the page up with the correct mime type. I think using rewrite rules, either apache or nginx would be able to resolve the right file on the back end.

If this isn't possible, we would need to automate munging the html in the scraped files changing href="/issue23" to href="/issue23.html".

Also I don't see a reasonable way to generate an index page. How useful would a series of /issue-1:1000.html pages listing issue numbered 1-1000 be for finding an issue? People could jump to a range easily enough by specifying issue-5001-6000. But would this be useful?

This ties in with searching the site. Roundup provides faceted searching (status, message text, title, assignedto...). I am not sure if facted searching is needed when it is a static site. If you expect this to be used by direct link (somebody on the internet references b.p.o/issue2346) and a standard google index of the static site is sufficient, we can dispense with this issue. If you need to retain the ability to find all issues where Ee is assigned and "assigned NEAR edurbin" (I think that still works in google) isn't sufficient we may need something else. For example elastic search or something based on sqlite fts5 search (with different facets in different columns).

That's a few things to consider off the top of my head.