tc39 / archives

Archival material from old web properties.
https://tc39.github.io/archives/
3 stars 6 forks source link

Archive TC39's GitHub presence #4

Closed littledan closed 4 months ago

littledan commented 6 years ago

TC39 has historically had a script used to periodically checkpoint its GitHub archive for storage in Ecma's servers. This needs to be recreated now, as the current version is no longer working. cc @ecmageneva @keithamus

keithamus commented 6 years ago

Can we please get a list of requirements for what is needed? If regular source code snapshots are needed then just git clone will suffice. If we need access to issue data, pull request data, and the source for each pull request - we can make use of the Graphql API to get to all of that data relatively easily. If we need more info, such as the source code of forks, the issues and pull requests on those forks - that may be a more complicated endevour.

Could someone please enumerate what the existing archival script recorded? This might be a useful starting point.

allenwb commented 6 years ago

The biggest concern is capturing metadata that isn't in git. In particular issues and PR comment threads

The last time I tried, the current script worked but was slow, and occasionally failed because of rate limits or timeout issues.

Note it is a raw dump and currently there are no provisions for extracting data from the dump. But it should be possible

So here is a scenario that ecma needs to enable:

Assume it is the year 2070. GitHub (and git) are long abandoned services and technologies. A history of technology PhD student wants to research the evolution of ES class features between ES 2015 to ES 2025. What does ECMA need to archive today to ensure that the raw material for that research will be available.

allenwb commented 6 years ago

The current backup script is in https://github.com/tc39/Ecma-secretariat

IgnoredAmbience commented 6 years ago

The current backup script is in https://github.com/tc39/Ecma-secretariat

This repository doesn't appear to be visible to delegates.

allenwb commented 6 years ago

Fixed

IgnoredAmbience commented 6 years ago

Summary of that repository: the existing backup script is a wrapper around https://github.com/josegonzalez/python-github-backup

keithamus commented 6 years ago

I'm still unable to see the repo. Can someone please tell me how python-github-backup is invoked - specifically does it include the -F flag?

IgnoredAmbience commented 6 years ago

The line in question is: github-backup tc39 -o $BKUPDIR --all -O -P -t $TOKEN

The script was last modified on 11 Apr 2016, so will correspond to version 0.9.0 of the python-github-backup tool.

IgnoredAmbience commented 6 years ago

Possibly the breaking change for running this script for Ecma was the dropping of Python 2 support?

allenwb commented 6 years ago

@keithamus you should now have access to the repo

allenwb commented 6 years ago

@IgnoredAmbience note that the script does an updates of its version of github-backup each time the script is run.

keithamus commented 6 years ago

GitHub has started to offer a user migration API. With this API you can make REST request to GitHub's migration endpoint, and it will kick off a process of .tar.gzing the following data:

You can then query for when the tar becomes available, and when it does, you will be given a URL to download the .tar.gz in its entirety.

In other words, GitHub now has a blessed route to do (almost) everything that the python script does.

I say almost, because the big question we still have is: how important are forks. The backup script right now does --all which implies -F, which goes ahead and downloads every fork in the tc39 network. With every proposal repo (which have between hundreds to thousands of forks) and the Spec (6k forks and counting) you're talking north of 20-30,000 additional repository downloads. Forks (in general) offer very low signal to noise: many forks have 0 changes to them, some will have 1 or more changes which are already available in the central repos PR data, and a small percentage will have commit data which was never pull requested.

In summary (TL;DR): if we can forgo the requirement of downloading forks - which adds significant burden to the process - then GitHub has a recent, built-in, turnkey solution to this.

littledan commented 6 years ago

@keithamus Will this technique download forks that have PRs against the main repositories?

ljharb commented 6 years ago

If it includes the pr, does it need the full fork?

allenwb commented 6 years ago

We certainly don't need to archive thousands of working forks that never make there way back into the TC39 process. But here is the use cases we need to think about:

The content of such a forked repository should be achieved as part of the TC39 deliberative record. I don't think we should accomplish this by try to archive all forks, but we should have a documented process that the developer of such an alternative proposal can follow to make sure that it does get archived.

ljharb commented 6 years ago

Those forks should be transferred to TC39 in that case, i think (or forked to TC39, to achieve the same result)

allenwb commented 6 years ago

Those forks should be transferred to TC39 in that case, i think (or forked to TC39, to achieve the same result)

I agree. Actually my main point is that we need to have a document process and expectations to ensure this happens.

keithamus commented 6 years ago

I think if we can agree a protocol that any forks presenting in meeting should be PRd or transferred to the tc39 org, it would be vastly preferable to attempting to archive 20-30k+ forks. I don't want to sound like a broken record - but it pretty much hinges on this requirement, and downloading literal gigabytes of 99% duplicate data seems like a waste of time.

keithamus commented 6 years ago

An alternative could be that we iterate through every fork, check to see if those forks have commits that don't feature in the source repos history or PRs and download only those forks. However this would require a non-trivial amount of engineering effort and would still run into the same rate-limiting issues. So it seems like less of a tech problem and more of a people problem - adding rules + discipline can solve this easier (and would be more procedurally correct).

littledan commented 6 years ago

It's true that there are some forks in use like what @allenwb describes, for example https://github.com/valtech-nyc/proposal-fsharp-pipelines . Let's encourage maintainers of these forks to establish a WIP PR against the main repository to opt into archiving.

For me, the high order bit is that we've been missing effective archives for some time, and getting started again will be a big benefit, even if it's not perfect in terms of coverage the first time.

jorydotcom commented 6 years ago

So I've been poking around at this a bit tonight & giving the endpoint @keithamus shared a shot. Unfortunately I've run into two issues: 1) It's not really supported yet; the Octokit SDK doesn't have any documentation on this so I kinda pieced it together and was able to get the expected responses. 2) It doesn't seem to work for org repos. I was able to use the API to generate test archives of my own repos, but failed when I tried initiating an archive for small tc39 repos that I have admin rights to (like tc39/tc39-notes). It seems like the Orgs Migration endpoint will work, but you have to be an org owner - I'm not sure who that is to be honest. @littledan?

The github-backup script ran fine for me earlier until I was rate-limited; I may try it again overnight and see if it happens to work.

@allenwb @ecmageneva a nice feature might be to adopt something like what the WHATWG is doing, wherein they do snap-shots as part of their build & deploy steps so you can go to a snapshot version at any given commit (they used to have a warning banner to make it more obvious you were looking at an outdated version of the spec, not sure what happened to that). It doesn't solve the document archival problem but it does give access to a historical view at a given moment in time.

Here's their build script for reference.

littledan commented 6 years ago

Thanks for the great work here Jory!

Looks like a lot of people have owner access, including @jugglinmike @rwaldron @leobalter. I wouldn't mind giving @jorydotcom access if others are OK with it, cc @bterlson.

Adapting WHATWG's script sounds good to me, if we can make it work for us. Seems like really useful functionality.

jorydotcom commented 6 years ago

Good news, @ecmageneva! I ran Allen's script overnight and it definitely worked (took about 4 hours according to the timestamps). I'm betting the issue you had is related to @IgnoredAmbience's post & we need to update Python + pip on your computer.

Bad news: the resulting zip file is 484MB so I can't just email it to Patrick. I'll email you both to see if you're able to use dropbox, or if I can just write to the NAS & ya'll tell me where to put it.

@littledan RE adding some build functionality; do you think that would be something Yulia would be interested in discussing too?

littledan commented 6 years ago

I know it was raised in the TC39 meeting, but I honestly don't see much of a close dependency between archiving and the website or groups. If we just keep using GitHub, then this archiving strategy should "just work". Worth verifying of course. Cc @codehag

codehag commented 6 years ago

I agree that archiving should be treated separately, as if we try to work it into the website project we will lose focus and not do either well.

Regarding using github for archiving, do we have a document outlining how this will work? I know that there is a crawler that Keith is working on, and that other people are working on getting the old archived material back into a browsable form. It would be great to have a meeting regarding this to understand where we are with everything.

keithamus commented 6 years ago

Apologies: I've been on a bit of a vacation recently so haven't been able to keep up to date with these things recently.

@jorydotcom I'd be happy to work with you on resolving the issues you have with GitHub's migrations feature. Using this would likely be preferable to the backup script - especially if it is getting continuously rate limited. The APIs are likely behind an ACL that means only owners can migrate. I would recommend we have owners try the migration endpoints.

jorydotcom commented 6 years ago

no worries, @keithamus - I totally agree the GH API would be preferable to the script & would be happy to work with you on the crawler &/or API issues. This is your domain, after all!

There's a small ad hoc history group forming with @allenwb & @ecmageneva that this work is specifically pertinent to. Would be great to have the archiving conversation tied in with that group if everyone agrees it makes sense.

littledan commented 6 years ago

@jorydotcom GitHub archiving seems like a very important important history task. @IgnoredAmbience has been doing some great archiving work; maybe he would also be interested in this group.

IgnoredAmbience commented 6 years ago

I'm running on the assumption that all my archival work will go into this repository and thus be picked up by the tc39 org archive for ECMA when it is taken.

The one potential interface with the website would be to make the tc39.github.io/archives site fit better with the overall website design, I'd briefly discussed this with @codehag over IRC a couple of weeks back.

jorydotcom commented 6 years ago

@IgnoredAmbience would you be interested in joining an ad hoc discussion we're trying to arrange for the second week of September RE history? https://github.com/tc39/Reflector/issues/165

ctcpip commented 4 months ago

resolved via https://github.com/tc39/archive

ecmageneva commented 4 months ago

Not sure where we are on this issue now..... Just for those who are new here.

For several years Ecma tried to capture all TC39 related entries of the TC39 GitHub pages. We never had a well functioning solution, where we could also search and find all TC39 info again. Apparently GitHub did not offer a feature/service (I do not know how to call it) that would allow to capture all TC39 relevant information into the Ecma Private Server (It was a Synology Server, now I do not know...) under the TC39 Directory.

At the beginning Allen (and myself until the script worked from Europe....later some timer did not allow it...) and Jory gathered with a script everything under TC39 GitHub. But we always had a big problem: We only captured the files, but had never an effective software to search and present it. I complained about this in many TC39 meetings. So we have put everything in around a yearly fashion (later even less frequently...) into a huge ZIP file that got an Ecma filenumber (like TC39/2018/xyz). So, to be honest in my opinion it was not terribly useful exercise.

Then, this is what we do for the last couple of years (I do not remember when Jory did the last script run...) we took all information from the GitHub (practically via file duplication into the Ecma Private FIle Server) that we have regarded as key information for long-term Ecma storage (These were e.g. the contribution slides, GitHub drafts of the specs, a copy of the Technical Notes etc.). We selected those docs that we regarded as for Ecma relevant information for the long term storage. This is required by the WTO guidelines on how SDOs should work. So, it is a manual process, and in my opinion we had for the Ecma long term storage all relevant information now. In my opinion the hand-picking of those documents is not too elegant, but it is ok. It takes after each meeting to collect the data into a ZIP File about an hour Secretariat work. WE have been doing this now over several years, the last one we did immediately after the 2024 April meeting. So it works. PLease note that according to Ecma Rules we have to finish the Meeting Minutes and the ZIP file in less than 3 weeks. We always were able to match that so far. For the ZIP we make an update even after the 3 weeks deadline when the Technical Notes are become available.

Parallel to it, according to GitHub (the MS company) also have a long-term storage project to save all the GitHub information for all GitHub users forever, I think it is somewhere in the Artic region in Norway. But concretely what is behind, what is a plan, where they are I just do not know....

So, this is the current situation. All in all have a working solution, but of course - as everything - this can be also improved....

keithamus commented 4 months ago

Parallel to it, according to GitHub (the MS company) also have a long-term storage project to save all the GitHub information for all GitHub users forever, I think it is somewhere in the Artic region in Norway. But concretely what is behind, what is a plan, where they are I just do not know....

This program is called the "Arctic Code Vault" and is situated in the Arctic World Archive in Svalbard, Norway. As to what is inside:

https://github.com/github/archive-program/blob/master/GUIDE.md#whats-inside

The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. (Repos with 250+ stars retained their binaries.) Each was packaged as a single TAR file.

ljharb commented 4 months ago

Sadly there's no way to see what's in there :-/ there's a bunch of valuable since-deleted content in there that'd be great to recover.