ishepard / pydriller

Python Framework to analyse Git repositories
http://pydriller.readthedocs.io/en/latest/
Apache License 2.0
840 stars 146 forks source link

Pydriller 2.0 - Ideas #159

Closed ishepard closed 3 years ago

ishepard commented 3 years ago

This is thread to discuss what could be included in the next version of Pydriller. Feel free to drop here your ideas, and we will discuss it together. There is already a branch (pydriller2) where you can work on and create PR if you feel like contribute to the project!

xtuchyna commented 3 years ago

Hello! I've recently started to dig into the wilderness of MSR and found this lib, which looks really cool! Seems that currently you only support Commit aggregation - do you also consider expanding it to other entities like Issues, PullRequests, Comments, Contributors, and etc.?

ishepard commented 3 years ago

Hi @xtuchyna! Happy mining! 😄 Issues, PR, Comments, etc. are not Git concepts. They are GitHub concepts. The git repository doesn't have any information about them. Since Pydriller is a git framework, we will not expand it to those entities. However, if you need to look into them, I'd suggest you to look at my colleague's project GHTorrent. Hope it helps!

JulianVolodia commented 3 years ago
  • Renaming Modifications: I am not sure what's the English word that best describes "modified files in a commit", but I kind of dislike modification. I took this name from the parent project Repodriller, but I think it's now time to change it. Maybe modified_files is good enough? Ideas on this @marco-c @mauricioaniche ? So that the user can do:

Hey @ishepard could I do sth for You? I.e. write some test cases or other work? :)

ishepard commented 3 years ago

Hi @JulianVolodia! Thanks for the help! Feel free to pick up any issue currently opened, for example #154 should be pretty easy to solve. Feel free to write any test you want as well 😃

mauricioaniche commented 3 years ago

I can't remember why I called it Modification. Maybe this was the term used in JGit? Too long ago, can't remember. But I agree it's not the best name ever.

xtuchyna commented 3 years ago

Hello, I'm here again :wave: I've recently stumbled upon an article describing the bugspots idea (counting number of bugfixing commits - based on log message - to each file). I wonder if that could be part of the Process Metrics.

HelgeCPH commented 3 years ago

Hej, thank you for the really nice and super useful Python library! Concerning possible 2.0 features, I was thinking that it could be practical for certain analysis if the Commit and Modification classes were hashable. Currently, one cannot use them as keys in dictionaries (TypeError: unhashable type: 'Modification'). Additionally, sometimes it would be practical if one could navigate back from a Modification object to the corresponding Commit object. Currently, I have some weird code around that could be simplified if I would not have to keep track of these relations always coming from a commit object.

Both of the features above are only nice to have things and they might unnecessarily increase internal complexity of Pydriller. I will continue to use your cool library also without them :) Thanks again for it.

ishepard commented 3 years ago

HI @HelgeCPH! Interesting ideas! Thank you! I think an Hashable Commit and Modification object would be a very good addition. People would be able to put them in dicts and sets and other hashable structs. As for the Modification object having a link to the commit: it's a very simple addition, though I am wondering whether it would be enough to have the hash of the commit (instead of the Commit object). Mainly because if we pass the Commit object we create a circular dependency: Commit -> Modification -> Commit. I'll think about it! Thanks again for the good ideas!

PS: if you feel like contributing, feel free to apply these changes to the branch pydriller2.0 and open a PR 😄

luwoldy commented 3 years ago

@ishepard Right now only an author/committer's full name and email is available in the Developer object. Is it possible to expand this to include their github username? That way I can use something like Selenium to go to github.com/ to extract additional data like follower count, commits, etc.

Happy to attempt this PR with guidance

HelgeCPH commented 3 years ago

Hi @ishepard, I sent a pull request with the hashable Commit and Modification classes.

Concerning the circular dependency: yes I see the problem but I was thinking if it is really a problem. Would not Python's garbage collector be able to find non-referenced cycles? I mean if one would make people aware about it in the documentation, I think it could work. Only putting the hash for the backwards reference is -at least for my use cases- likely not enough, since one would still have to write a lot of loops or keep extra "houskeeping" dictionaries around to quickly find back a commit from a modification. That is, I understand your concern and I can understand if it will not come into PyDriller :)

I was also thinking about the naming of the Modification class. Would not the closest concept in Git terms be Blob? So it could become DrillerBlob or something similar to indicate that it is not an actual Git Blob.

marco-c commented 3 years ago

Circular dependencies can be problematic with Pickle, from https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled:

Trying to pickle a highly recursive data structure may exceed the maximum recursion depth, a RecursionError will be raised in this case.

I recently hit this in another project where (to simplify) we had Commits and Tests. Initially there was only a link between Commit and Tests associated to it. Once we added a link from Test back to Commit we started hitting the RecusionError when pickling many Commit objects.

ishepard commented 3 years ago

Agree with Marco, circular dependencies aren't fun. We should avoid them. In terms of just putting the hash, I think it should be ok right? It's not much more code, since one can just say git.get_commit(hash) and that's it.

david-siqi-liu commented 3 years ago

Hi everyone! Just a quick idea - in the past, I have also computed entropy based on this paper. It is a measure of dispersion in lines changed across all modified files. How would you feel about adding this?

stefanodallapalma commented 3 years ago

Hi @david-siqi-liu,

Pydriller already implements that metric. You can find it at pydriller.metrics.process.history_complexity. Because you have also computed it in the past, it would be great if you could further test it.

ishepard commented 3 years ago

Sorry I missed some comments on this thread! I'll reply here: @xtuchyna I often thought about bug prediction in Pydriller, but I refrained from the idea mainly because it's not something directly related to Git. Pydriller is a Git framework, and it can be used to build bug prediction tools. But I wouldn't transform Pydriller in a bug prediction tool directly.

@luwoldy GitHub username and Git username are 2 different things, there is no way to get the GitHub username from a commit, only the Git username. Since you can set up the usernames to be different, I don't think there is a way to solve this problem unless we query GitHub.

stefanodallapalma commented 3 years ago

@xtuchyna I agree with @ishepard. However, me myself have been implementing a tool (https://github.com/radon-h2020/radon-repository-miner), based on PyDriller, to mine software repositories to identify failure data and ease the creation of datasets that can be used to build machine learning models for bug prediction. If you'd like to know more feel free to get in touch.

ishepard commented 3 years ago

I'm gonna release Pydriller 2.0 soon, thanks everyone for the help! I'll make a post on Twitter with the changes!