ietf-ribose / svn-github-convert

Repository for `reposurgeon` svn => git migration scripts
1 stars 0 forks source link

Add Trac exports via `tractive` #18

Closed ronaldtse closed 2 years ago

ronaldtse commented 3 years ago

xml2rfc

xml2rfc-website

xml2rfc-bibxml

vocabulary_design_team_2013_2017

The Trac tickets for the "vocabulary_design_team_2013_2017" repo can be seen here https://trac.ietf.org/trac/xml2rfc/query

Screenshot 2021-09-25 at 9 46 30 AM

(originally from https://github.com/ietf-ribose/svn-github-convert/issues/7#issuecomment-926992804)

ietfdb (Datatracker)

Screenshot 2021-09-23 at 4 12 47 PM

ietfdb (Mailarchive)

Screenshot 2021-09-23 at 4 16 29 PM

Mailreplay

No Trac, no need to migrate issues.

Postconfirm

No Trac, no need to migrate issues.

Postfind

No Trac, no need to migrate issues.

dmarc

Trac: trac-svn-db/trac/dmarc SVN: none.

emailcore

Trac: trac-svn-db/trac/emailcore SVN: none.

tsvwg

Trac: trac-svn-db/trac/tsvwg SVN: none.

ronaldtse commented 3 years ago

Remaining: mailreplay, postconfirm, postfind, dmarc, emailcore, tsvwg. (Updated in OP)

ronaldtse commented 3 years ago

@rjsparks could you help confirm the following?

  1. For the new ietfdb repository, all Trac tickets that relate to components other than MailArchive: * should be migrated there (see screenshot).

(https://trac.ietf.org/trac/ietfdb/query)

Screenshot 2021-09-28 at 9 39 13 AM
  1. For the new xml2rfc-website and xml2rfc-bibxml, how do we filter out relevant tickets via components (see screenshot)? Or perhaps no tickets should be migrated into these split off repositories?
Screenshot 2021-09-28 at 9 41 41 AM
ronaldtse commented 3 years ago

I have added a new GitHub organization called ietf-svn-conversion (https://github.com/ietf-svn-conversion) for the testing of migrating issues. All users have full admin rights there so that you can delete/re-create the repos there when a re-run is necessary.

HassanAkbar commented 3 years ago

@ronaldtse What should the filters be in case of xml2rfc-website and xml2rfc-bibxml ?

Screen Shot 2021-10-11 at 2 46 26 PM
ronaldtse commented 3 years ago

@HassanAkbar For both of these repos, we do not need to import tickets/issues. Thanks!

HassanAkbar commented 3 years ago

@ronaldtse Where can I get the email to git username mapping for all these trac instances? Do we have a separate repository for that?

ronaldtse commented 3 years ago

The Trac/SVN username to GitHub email mappings are provided in this repo, under this pattern: {reponame}/{reponame}.map.

HassanAkbar commented 3 years ago

The Trac/SVN username to GitHub email mappings are provided in this repo, under this pattern: {reponame}/{reponame}.map.

@ronaldtse we need to provide Github username to Github API. The API does not support taking Github Email as an input.

ronaldtse commented 3 years ago

Ah, then we need to do that manually. Maybe we should just define a YAML mapping?

ronaldtse commented 3 years ago

I've provided a structure here: https://github.com/ietf-ribose/svn-github-convert/blob/main/github-users-email.yaml .

We have to do this manually because some emails are not "public GitHub emails" that cannot be searched on GitHub, but they show up in Git commit history.

ronaldtse commented 3 years ago

I've made a new issue for this: https://github.com/ietf-ribose/svn-github-convert/issues/35 . Tractive needs to utilize this new mapping format for GitHub email => username.

HassanAkbar commented 3 years ago

I've made a new issue for this: #35 . Tractive needs to utilize this new mapping format for GitHub email => username.

Sure thing. I will take over this after getting the Github workflow finalized.

HassanAkbar commented 3 years ago

I've made a new issue for this: #35 . Tractive needs to utilize this new mapping format for GitHub email => username.

Just to clarify, are we changing the format of user mapping provided to tractive in the config.yaml file from

adam@nostrum.com: adamroach

to

mapping:
  - name: Adam Roach
    email: adam@nostrum.com
    username: adamroach

@ronaldtse I just wanted to know if there is a special reason for that because hash will give us O(1) for retrieval while the new format (of the array) will take O(N) for retrieval.

ronaldtse commented 3 years ago

Yes, the reason is we want to keep the name, email and username together for information purposes.

There should be no performance difference here because we should read the entire file at once in the beginning, then we can convert this into a hash internally. A big-O difference only happens if we’re using this structure to search, but we’re not (I.e. file data representation != data representation in Tractive).

HassanAkbar commented 3 years ago

@ronaldtse I encountered a few issues while creating trac issues on Github

  1. We can not assign an issue to someone unless he is a collaborator in repo or has added a comment on that issue, (https://github.blog/2019-06-25-assign-issues-to-issue-commenters/)
  2. The usernames in commits are not the actual git usernames. For example, tom111.taylor is listed in a commit but there is no GitHub user with that username. Adding an image for reference:
Screenshot 2021-10-14 at 3 42 35 PM

This is because we can add any username while making a commit to Github but this can't be done when creating issues. We can only add valid GitHub usernames of the users that are either collaborators or have commented on that issue.

It feels like reposurgeon is creating usernames for commits by stripping everything in the email address after the @ symbol. So for the above case tom111.taylor@bell.net reposurgeon will make a commit from tom111.taylor username.

Screenshot 2021-10-14 at 3 49 18 PM

I think we will have to add these users as collaborators to our final repository if we want to assign issues to them. P.S. I have tested this workflow unless a user accepts the invitation as Github collaborator he can't be added as an assignee. What do you suggest we should do here?

ronaldtse commented 3 years ago

@HassanAkbar there is a misunderstanding:

It feels like reposurgeon is creating usernames for commits by stripping everything in the email address after the @ symbol.

Perhaps the list in the screenshot is outdated? In the current repo, we no longer have the reposurgeon names. We have changed them all to real names: https://github.com/ietf-ribose/svn-github-convert/search?q=tom111

Screenshot 2021-10-14 at 7 41 06 PM

We obtained all these emails from an IETF internal system, the emails are correct. However, the emails are not necessarily mapped to a GitHub-registered email, and that user may not have a GitHub account.

This is the list of emails not yet mapped to a GitHub account:

So in order to make the mapping happen we really need to finalise the SVN user to GitHub email/username mapping.

HassanAkbar commented 3 years ago

@ronaldtse

Thanks for the clarification. You are right that the emails have been updated in the file. I need to finalize the SVN user -> GitHub email/username mapping. I am working on this.

I just wanted to highlight this 2nd issue:

We can not assign an issue to someone unless he is a collaborator in repo or has added a comment on that issue, (https://github.blog/2019-06-25-assign-issues-to-issue-commenters/)

I think we will have to add these users as collaborators to our final repository if we want to assign issues to them. I have tested this workflow unless a user accepts the invitation as Github collaborator he can't be added as an assignee.

Thanks.

ronaldtse commented 3 years ago

Indeed, I forgot about that. Perhaps we cannot migrate the assignments until all these users accept the invitation.

@rjsparks do we need to migrate the ticket assignments? If so, could we ask all users to accept the GitHub invitation in the final repository for migration?

rjsparks commented 3 years ago

No - it's fine (maybe a feature) if ticket assignment gets lost. @rpcross, @jennifer-richards, @kesara: heads up...

ronaldtse commented 3 years ago

@rjsparks so just to confirm that we "won't migrate ticket assignments" but keep them in the issue comments as text.

By any chance @rpcross is the elusive Ryan Cross we've trying to pin down? Would he be able to add his amsl.com email to his GitHub account to enable linking?

HassanAkbar commented 3 years ago

@ronaldtse As we will be creating all issues from one Github account. We will be needing the Github Personal Access Token of the user we want to use for final migration.

HassanAkbar commented 3 years ago

@ronaldtse Do we need 2 separate repos for ietfdb (Mailarchive) and ietfdb (Datatracker)?

rjsparks commented 3 years ago

Yes, the mailarchive should end up in its own repository separate from the datatracker.

HassanAkbar commented 3 years ago

There are total 3389 tickets in total in ietfdb.

For MailArchive, there are only 327 tickets and for Datatracker there are only 205 tickets.

Do we need to migrate to the remaining tickets in a separate repository ?

ronaldtse commented 3 years ago

Thanks for checking. I don’t remember what the other components are other than Datatracker. Do you have a list?

ronaldtse commented 3 years ago

On https://trac.ietf.org/trac/ietfdb/query there are many components seemingly not related to datatracker, eg. Projects, noncom, drafts etc.

@rjsparks to preserve history I suspect we may want to store those tickets somewhere. Where should those go?

HassanAkbar commented 3 years ago

Thanks for checking. I don’t remember what the other components are other than Datatracker. Do you have a list?

@ronaldtse Here is the list of all the components.

P.S there are some tickets with the Empty or null component in database.

ronaldtse commented 3 years ago

@HassanAkbar “MailArchive: *” tickets will go into the “mailarchive” repo.

Will let @rjsparks answer the question about where the remaining tickets should go.

HassanAkbar commented 3 years ago

I ran a test migration and the issue are now migrated to the following repos:

rjsparks commented 3 years ago

All of the tickets in ietfb that have a component that doesn't start with MailArchive belong to the datatracker (even the ones that have no declared component). The long list above (minus the two MailArchive components) are parts of the datatracker, or were datatracker-oriented projects.

ronaldtse commented 3 years ago

@rjsparks got it. Then can we confirm that there should only be two repositories as target outputs for ietfdb: mailarchive and datatracker? (i.e. the ietfdb name will go away)?

rjsparks commented 3 years ago

that is correct

ronaldtse commented 3 years ago

Thanks @rjsparks . @HassanAkbar can you help update the config so we are exporting to those two repositories without the ietfdb- prefix? Thanks.

HassanAkbar commented 3 years ago

@ronaldtse In tsvwg database there is an email draft-ietf-tsvwg-source-quench@tools.ietf.org which when combined to form a label like owner:draft-ietf-tsvwg-source-quench@tools.ietf.org exceeds the maximum length allowed for label creation which is 50 characters.

What should we do in this case?

Update:

Right now I have changed the email from draft-ietf-tsvwg-source-quench@tools.ietf.org to draft-ietf-tsvwgsource-quench@tools.ietf.org to bypass the label length limitation.

HassanAkbar commented 3 years ago

@ronaldtse Ran another round of test migrations here are the results:

HassanAkbar commented 3 years ago

@ronaldtse Also we should finalize the label colors before final migration to make sure they are the same across all the repositories. Any suggestions are welcome for this.

ronaldtse commented 3 years ago

@HassanAkbar The label colors look good now!

The current "owner" label is a bit awkward looking. But since we cannot expect everyone to (a) have a GitHub account; (b) be added to the repos; there is no better alternative.

Screenshot 2021-10-26 at 9 45 02 AM

On the other hand, when we migrate tickets, it is possible to convert the emails into GitHub usernames, given that we have the mapping (if there is no mapping we just show the email):

Screenshot 2021-10-26 at 9 46 11 AM

@rjsparks could we ask you for feedback on the migrated tickets (see above comment), and let us know if in the comments we should also tag the GitHub user?

rjsparks commented 3 years ago

In the final migration, will "opened on date by Hassan Akbar" be replaced by "opened on date by Real Author" when the users github account is actually known?

If really hope so, but if not, how am I supposed to find all of the tickets opened by a given person?

If so, then we wouldn't need to tag Jay above - he'd just be the user that opened the ticket, and for those users don't have github accounts that we know about having it be by is fine.

rjsparks commented 3 years ago

What's the current plan for when we will be able to check that references to subversion commits in these tickets are mapped correctly to github commits? (And that commit messages in svn that contain references to trac tickets end up with git commit messages that reference the right issue?).

The [19412] in the last comment at https://github.com/ietf-svn-conversion/datatracker/issues/3424 will eventually need to be a link to some github commit.

rjsparks commented 3 years ago

Tell me more about the solution for component - I think it's just a bit of text marked as code in the first comment? For us to continue to use the components to manage the project, that adds a bit of arcana that anyone working with the system would have to remember to add? Or do I misunderstand what's happening with that?

HassanAkbar commented 3 years ago

In the final migration, will "opened on date by Hassan Akbar" be replaced by "opened on date by Real Author" when the users github account is actually known? If really hope so, but if not, how am I supposed to find all of the tickets opened by a given person? If so, then we wouldn't need to tag Jay above - he'd just be the user that opened the ticket, and for those users don't have github accounts that we know about having it be by is fine.

Unfortunately, the opened by will not be the real author but instead will be the account of the user that is used to perform the migration. This is a restriction by Github API.

In the above case, I am using my account to perform migrations so that is why it says opened on date by hassan akbar.

If the username is provided in the config file, a label having format owner:<github username> is added that can be used to search for tickets opened by a given person.

What's the current plan for when we will be able to check that references to subversion commits in these tickets are mapped correctly to github commits? (And that commit messages in svn that contain references to trac tickets end up with git commit messages that reference the right issue?). The [19412] in the last comment at ietf-svn-conversion/datatracker#3424 will eventually need to be a link to some github commit.

A PR is opened to fix this.

I will leave the last question for @ronaldtse to answer.

ronaldtse commented 3 years ago

Tell me more about the solution for component - I think it's just a bit of text marked as code in the first comment?

Right now we map "component" to labels.

For us to continue to use the components to manage the project, that adds a bit of arcana that anyone working with the system would have to remember to add? Or do I misunderstand what's happening with that?

You are correct. The difference is that on GitHub the component label is not mandatory for creating an issue.

To aid this we could potentially name the component labels as "component:foobar" for a component "foobar" to make them clear. Would that help?

ronaldtse commented 3 years ago

In the final migration, will "opened on date by Hassan Akbar" be replaced by "opened on date by Real Author" when the users github account is actually known?

This is an unfortunate fact that accompanies the migration -- both GitHub Issues APIs (bulk and v3) do not permit creating issues on behalf of other users. The security implications are clear why they wouldn't want that. At least we are able to set the original "date"!

That's why tagging the user's GitHub handle is important in the migrated issue.

rjsparks commented 3 years ago

So then the final migration should be done by an identity that makes it very clear what happened. An account created for just this purpose with a name of 'ietfsvnmigration' or something that conveys the message but is shorter.

This is really sad in that it will not reflect all the contribution (in terms of issues) on user's activity graphs.

See @larseggert for example:

image

I'll need to explain that carefully and fully to the community.

rjsparks commented 3 years ago

mapping the components to labels "component:foobar" would be better - it would make it more intuitive for people reporting to do the right thing.

ronaldtse commented 3 years ago

This is really sad in that it will not reflect all the contribution (in terms of issues) on user's activity graphs.

Indeed. At least the commits will show on their contribution log. This is a known caveat with migrating to GitHub, for now.

ronaldtse commented 3 years ago

mapping the components to labels "component:foobar" would be better - it would make it more intuitive for people reporting to do the right thing.

Got it. @HassanAkbar can you help make the corresponding change? Thanks.

HassanAkbar commented 3 years ago

mapping the components to labels "component:foobar" would be better - it would make it more intuitive for people reporting to do the right thing.

Got it. @HassanAkbar can you help make the corresponding change? Thanks.

Sure thing.

HassanAkbar commented 3 years ago

@ronaldtse

  1. For changing SVN revision to GIT SHA in the commit messages, we need to create new commits because commits are immutable in Git. We can only Remove & REDO to change an existing commit message.
  2. Each commit after REMOVE & REDO will have new SHA, so we will have to update the revmap file after updating each commit message.
  3. We can not do this in reposurgeon because it creates a fast-import stream for git to create the repo. Before importing that file to Github, we don't have the SHA hashes. So, we can only edit commit messages after the import of Github repo.
  4. I am investigating if git-filter-repo tool can edit the commit message without affecting Tree of commits.

Let me know if you have any other thoughts here.

UPDATE: Found an example in the documentation of git-filter-repo:

git-filter-repo --message-callback '
  if b"Signed-off-by:" not in message:
    message += b"\nSigned-off-by: Me My <self@and.eye>"
  return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)'

We can do something like this and read the SHA hashes from revmap file. But this works only with python. Should we go for this approach?

ronaldtse commented 3 years ago

@HassanAkbar thank you for researching the approach.

  1. I think first of all let's find whether the IETF repos do contain those revision references. If the commit messages do not contain revision references, then we don't need to migrate them.
  2. If there are revision references, we will want to consider the volume of them.
  3. In addition I think it may be possible to do it "during" reposurgeon conversion in the '.lift' file.

Can you help check step 1? Thanks!