refactoring-ai / predicting-refactoring-ml

Refactoring recommendation via ML
MIT License
28 stars 8 forks source link

Database Duplicates #63

Closed jan-gerling closed 4 years ago

jan-gerling commented 4 years ago

We create a lot of duplicate entries in the normalized tables, e.g. CommitMetaData: image

This happens, because we use a custom generated Id to link the tables. This id is newly generated for every new Instance, even though the CommitMetaData is the same object.

Can we change the id to the commitId in order to reduce the database size and allow a unique mapping? Furthermore, this leads to another issue: #62.

mauricioaniche commented 4 years ago

This is indeed a bug. I can help you with hibernate magic once I’m back !

On Wed, 19 Feb 2020 at 17:26, Jan Gerling notifications@github.com wrote:

We create a lot of duplicate entries in the normalized tables, e.g. CommitMetaData: [image: image] https://user-images.githubusercontent.com/29139613/74825455-40ca4e80-530a-11ea-9775-c98eaf886c03.png

This happens, because we use a custom generated Id to link the tables. This id is newly generated for every new Instance, even though the CommitMetaData is the same object.

Can we change this in order to reduce the database size? Furthermore, this leads to another issue: #62 https://github.com/mauricioaniche/predicting-refactoring-ml/issues/62.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mauricioaniche/predicting-refactoring-ml/issues/63?email_source=notifications&email_token=AAAYTTFZZ437OM4GHPFP75TRDUCNHA5CNFSM4KXWKSK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IOS3VLQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYTTB5IEUZGSV7QJ7ZOY3RDUCNHANCNFSM4KXWKSKQ .

--

Maurício Aniche Delft University of Technology http://www.mauricioaniche.com

jan-gerling commented 4 years ago

I noticed that the ProcessMetricsCollector generates many duplicates in the function codeMetrics https://github.com/mauricioaniche/predicting-refactoring-ml/blob/1ed8d6f0527992d49c29f1af69c48680067cf476/data-collection/src/main/java/refactoringml/ProcessMetricsCollector.java#L234

We can avoid these duplicates, by only storing as complete as possible StableCommits in the database. Related to issue #75.

mauricioaniche commented 4 years ago

@jan-gerling add this info on monitoring, but for now, it seems fine.

jan-gerling commented 4 years ago

We have no duplicate commitmetadata entries, refactoringinstances, stablecommitinstances and projects in the db.

mauricioaniche commented 4 years ago

Great news!