src-d / ghsync

GitHub API v3 > PostgreSQL
https://sourced.tech
GNU General Public License v3.0
9 stars 8 forks source link

Support repository rename and issue migration #26

Open dpordomingo opened 5 years ago

dpordomingo commented 5 years ago

If I'm not wrong, the PR reviews are currently tied to its causing PR by its owner/repo/number, instead of using github repository.id.

Since a repository can be renamed, the current schema does not let yo to obtain info about comments per PR with this kind of queries:

SELECT pull_request_number, repository_owner, repository_name, count(*) as count
  FROM public.pull_request_reviews
  GROUP BY pull_request_number, repository_owner, repository_name
  ORDER BY count desc

It happens the same with issue comments, and also with issues. Since an issue can be also migrated to a different repo, or it can be renamed, sorting repos by the amount of created issues would not be also possible:

SELECT count(*) as count, repository_owner, repository_name
  FROM issues GROUP BY repository_owner, repository_name
  ORDER BY count desc;

See below :point_down: for more details and proposals to solve the described problem.

dpordomingo commented 5 years ago

When can this problem happen?

Alternatives

(A) To use provider ids as foreign keys

It could also speed up some queries/charts if it were also added fk: repo_id into issue_comments, pull_request_reviews and pull_requests_comments, so it would be easy to query for these entities per repo.

(B) listen for repo renames and issue migration

If every time it happens, the whole dataset is updated, it would be fully reliable the group: owner/repo/number. (But I think it could be a bit expensive in big orgs and big repos)