psf / gh-migration

This repo is used to manage the migration from bugs.python.org to GitHub.
42 stars 8 forks source link

Some GH PR references were not ported correctly #22

Open JustAnotherArchivist opened 2 years ago

JustAnotherArchivist commented 2 years ago

https://github.com/python/cpython/issues/90908#issuecomment-1093944756 has two incorrect references. The first one goes to GH-75453, which is bpo-31270. It should be GH-31270 (https://bugs.python.org/issue46752#msg413260). Note that the reference is correct in the following 'New changeset' comment.

This is the only case I've seen so far, but I haven't systematically looked for more.

JustAnotherArchivist commented 2 years ago

Found two more:

https://github.com/python/cpython/issues/91294#issuecomment-1093947224 references GH-76305 instead of GH-32124. https://github.com/python/cpython/issues/91302#issuecomment-1093947297 references GH-76242 instead of GH-32061.

ezio-melotti commented 2 years ago

It seems like GH-* were interpreted as issue number and converted automatically to the new migrated issue during the transfer. This automatic conversion was supposed to be disabled during the transfer, but apparently it wasn't.

vstinner commented 2 years ago

Another example. In https://bugs.python.org/issue47075 I explicitly wrote I proposed GH-32112 for that. to get a link to my PR https://github.com/python/cpython/pull/32112 but the converted issue https://github.com/python/cpython/issues/91231 points to https://github.com/python/cpython/issues/76293 i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)

ezio-melotti commented 2 years ago

i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)

During the transfer, https://bugs.python.org/issue32112 was transferred to https://github.com/python/cpython/issues/76293. The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR), and since bpo-32112 is now GH-76293, it converted GH-32112 into GH-76293.

gvanrossum commented 2 years ago

The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR)

How could that happen? That seems terrible. :-(

JustAnotherArchivist commented 2 years ago

Any update on whether/how/when this could hopefully get fixed?

vstinner commented 2 years ago

The transfer tool thought that your GH-32112 was a reference to bpo-32112

Would it be possible to process all data once more time to detect the issue and fix it? This issue is quite annoying.

The current workaround is to go to the original issue, find the message to look for the correct GH issue/PR number.

serhiy-storchaka commented 2 years ago

I agree that it is very serious issue. For now we have incorrect references and the reader can not even know that they are incorrect.

Can we repeat translation which correctly detects GitHub references locally, compare it with the old incorrect translation, and apply the diff? The diff should be orders smaller that the whole data (but it still can be thousands messages).

ezio-melotti commented 2 years ago

Since this happened at transfer time, one of the test repos that I kept around might still have the correct references and it might be used to fix the links. Doing it from bpo is a bit trickier because there is quite a bit of processing that happened at import time (even though it should be possible to extract only the references).

The main issue is that this requires a mass-rewrite which will have to go through thousands of issues and comments and edit them, and this is not trivial to write and test. At this point we will also have to use the public API, which is rate limited. If it is done, the same infrastructure could also be used to replace bpo links with GH links in the first message of the PRs, and to @mention again nosy list members in the first message of issues. I think this will also edit the "last updated" field of all the affected issues/PRs, making it more difficult to find old/stale issues/PRs.

serhiy-storchaka commented 2 years ago

How many issues are affected? 100, 1000 or 10000? If it is only few 100s, we can survive editing the "last updated" field. But in long term it would be better to find a way to edit messages without changing the "last updated" field. We may need it to fix references to old Subversion and Mercurial revisions.

ezio-melotti commented 2 years ago

I don't have a number, and when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).

JustAnotherArchivist commented 2 years ago

I can get an estimate of this shortly.

JustAnotherArchivist commented 2 years ago

There are about 19270 lines in messages on bpo (as of late March, shortly before the migration) that contain a link with the text GH-*. This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly). I don't see any differences between the two in bpo's HTML. For example, the links mentioned in the original issue are:

In addition, there are about 5993 lines in bpo messages that contain other links to GH PRs. Roughly half of these are PR # links (example), the others are mostly just plain URLs (example). A brief sampling seems to indicate that those were transferred correctly.

JustAnotherArchivist commented 2 years ago

One thing I suspected regarding those correctly migrated links is that perhaps messages that first contain a bpo link and then a GH PR link worked. This does not appear to be the case: exhibit A, exhibit B.

I also found one issue with a weird link on GitHub: this comment was rewritten as python/issues-test-cpython#30631. It should clearly point to https://github.com/python/cpython/pull/30631.

ezio-melotti commented 2 years ago

A few things that might help you:

This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly).

Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset') is probably enough).

If you want to see the pre-transfer messages, I can give you access to one of the test repos. If you have other questions let me know.

JustAnotherArchivist commented 2 years ago

Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?

I know that one limitation of the tool is that it could only convert issues that had already been migrated.

This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.

Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset') is probably enough).

I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this), but the hint about brackets is a good one! Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903, https://bugs.python.org/msg364827, https://bugs.python.org/msg410766. I also found this example of the form (GH-# and GH-#) where the former link is correct but the latter goes to that test repo: https://bugs.python.org/msg331261https://github.com/python/cpython/issues/53775#issuecomment-1093515121. Overall, there are about 777 GH-# links that are not preceded by (.

ezio-melotti commented 2 years ago

Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?

You are right: I didn't rewrite the GH-* references because those are already recognized by GitHub, but other refs like GH 1234 or GH1234 or PR-1234 aren't, so I used the regex to rewrite them. Instead of converting them into GH-1234 I converted them into MarkDown links in order to preserve the original text.

This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.

Note that there is quite a bit of overlapping between the PRs ids and bpo ids, so a bpo message that linked to the PR GH-1234 could have been mistaken by the tool for a link to an issue with id 1234 (since issues and PRs share the same namespace on GitHub), and after migrating bpo-1234 to GH-4321 all the references to PR GH-1234 could have been rewritten to GH-4321.

I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this)

Instead of scraping the HTML directly (if this is what you are doing), you could use the XMLRPC interface to fetch all the bpo messages. Even if you are scraping HTML, it shouldn't be difficult to isolate the individual messages. If you need more info about this let me know, however it might be better to look at the test repo to see how the messages look like after the import (I gave you read access to one of them [^1]).

but the hint about brackets is a good one!

Glad that helped! I noticed similar problems while testing before the migration.

Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903

This might be a separate bug in the transfer tool:

So my export tool kept references to GH-* unchanged and converted other references (e.g. GH *, PR *, etc.) to MarkDown links. Then, the transfer tool:

Can you confirm that your observations match these statements?

[^1]: I actually realized that the one I gave you access to (issues-test-cypthon-2) was used to test the transfer, so it had the same issues as the python/cpython repo. I now revoked your permissions to that repo and added you to a different repo (issues-test-2) that had the imported messages before they were transferred.

vstinner commented 2 years ago

when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).

I don't see the "last updated" change as an issue. It's a good thing that changes are traced and public, no?

ezio-melotti commented 2 years ago

Technically yes, however if we mass-update the issues you won't be able to find old or stale issues anymore since they will all show as recently updated. Before the migration we decided it would be better to preserve the original "last updated" date, but we could revisit that decision. Even if we do, we would still need a script to perform the mass update.