Closed ishepard closed 3 years ago
Hello! I've recently started to dig into the wilderness of MSR and found this lib, which looks really cool!
Seems that currently you only support Commit
aggregation - do you also consider expanding it to other entities like Issues, PullRequests, Comments, Contributors, and etc.?
Hi @xtuchyna! Happy mining! 😄 Issues, PR, Comments, etc. are not Git concepts. They are GitHub concepts. The git repository doesn't have any information about them. Since Pydriller is a git framework, we will not expand it to those entities. However, if you need to look into them, I'd suggest you to look at my colleague's project GHTorrent. Hope it helps!
- Renaming
Modifications
: I am not sure what's the English word that best describes "modified files in a commit", but I kind of dislikemodification
. I took this name from the parent project Repodriller, but I think it's now time to change it. Maybemodified_files
is good enough? Ideas on this @marco-c @mauricioaniche ? So that the user can do:
Hey @ishepard could I do sth for You? I.e. write some test cases or other work? :)
Hi @JulianVolodia! Thanks for the help! Feel free to pick up any issue currently opened, for example #154 should be pretty easy to solve. Feel free to write any test you want as well 😃
I can't remember why I called it Modification. Maybe this was the term used in JGit? Too long ago, can't remember. But I agree it's not the best name ever.
Hello, I'm here again :wave: I've recently stumbled upon an article describing the bugspots idea (counting number of bugfixing commits - based on log message - to each file). I wonder if that could be part of the Process Metrics.
Hej, thank you for the really nice and super useful Python library!
Concerning possible 2.0 features, I was thinking that it could be practical for certain analysis if the Commit
and Modification
classes were hashable. Currently, one cannot use them as keys in dictionaries (TypeError: unhashable type: 'Modification'
).
Additionally, sometimes it would be practical if one could navigate back from a Modification
object to the corresponding Commit
object. Currently, I have some weird code around that could be simplified if I would not have to keep track of these relations always coming from a commit object.
Both of the features above are only nice to have things and they might unnecessarily increase internal complexity of Pydriller. I will continue to use your cool library also without them :) Thanks again for it.
HI @HelgeCPH! Interesting ideas! Thank you! I think an Hashable Commit and Modification object would be a very good addition. People would be able to put them in dicts and sets and other hashable structs. As for the Modification object having a link to the commit: it's a very simple addition, though I am wondering whether it would be enough to have the hash of the commit (instead of the Commit object). Mainly because if we pass the Commit object we create a circular dependency: Commit -> Modification -> Commit. I'll think about it! Thanks again for the good ideas!
PS: if you feel like contributing, feel free to apply these changes to the branch pydriller2.0
and open a PR 😄
@ishepard Right now only an author/committer's full name and email is available in the Developer object. Is it possible to expand this to include their github username? That way I can use something like Selenium to go to github.com/ to extract additional data like follower count, commits, etc.
Happy to attempt this PR with guidance
Hi @ishepard, I sent a pull request with the hashable Commit
and Modification
classes.
Concerning the circular dependency: yes I see the problem but I was thinking if it is really a problem. Would not Python's garbage collector be able to find non-referenced cycles? I mean if one would make people aware about it in the documentation, I think it could work. Only putting the hash for the backwards reference is -at least for my use cases- likely not enough, since one would still have to write a lot of loops or keep extra "houskeeping" dictionaries around to quickly find back a commit from a modification. That is, I understand your concern and I can understand if it will not come into PyDriller :)
I was also thinking about the naming of the Modification
class. Would not the closest concept in Git terms be Blob
? So it could become DrillerBlob
or something similar to indicate that it is not an actual Git Blob
.
Circular dependencies can be problematic with Pickle, from https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled:
Trying to pickle a highly recursive data structure may exceed the maximum recursion depth, a RecursionError will be raised in this case.
I recently hit this in another project where (to simplify) we had Commits and Tests. Initially there was only a link between Commit and Tests associated to it. Once we added a link from Test back to Commit we started hitting the RecusionError when pickling many Commit objects.
Agree with Marco, circular dependencies aren't fun. We should avoid them.
In terms of just putting the hash, I think it should be ok right? It's not much more code, since one can just say git.get_commit(hash)
and that's it.
Hi everyone! Just a quick idea - in the past, I have also computed entropy
based on this paper. It is a measure of dispersion in lines changed across all modified files. How would you feel about adding this?
Hi @david-siqi-liu,
Pydriller already implements that metric. You can find it at pydriller.metrics.process.history_complexity. Because you have also computed it in the past, it would be great if you could further test it.
Sorry I missed some comments on this thread! I'll reply here: @xtuchyna I often thought about bug prediction in Pydriller, but I refrained from the idea mainly because it's not something directly related to Git. Pydriller is a Git framework, and it can be used to build bug prediction tools. But I wouldn't transform Pydriller in a bug prediction tool directly.
@luwoldy GitHub username and Git username are 2 different things, there is no way to get the GitHub username from a commit, only the Git username. Since you can set up the usernames to be different, I don't think there is a way to solve this problem unless we query GitHub.
@xtuchyna I agree with @ishepard. However, me myself have been implementing a tool (https://github.com/radon-h2020/radon-repository-miner), based on PyDriller, to mine software repositories to identify failure data and ease the creation of datasets that can be used to build machine learning models for bug prediction. If you'd like to know more feel free to get in touch.
I'm gonna release Pydriller 2.0 soon, thanks everyone for the help! I'll make a post on Twitter with the changes!
This is thread to discuss what could be included in the next version of Pydriller. Feel free to drop here your ideas, and we will discuss it together. There is already a branch (pydriller2) where you can work on and create PR if you feel like contribute to the project!
[x] Renaming
Modifications
: I am not sure what's the English word that best describes "modified files in a commit", but I kind of dislikemodification
. I took this name from the parent project Repodriller, but I think it's now time to change it. Maybemodified_files
is good enough? Ideas on this @marco-c @mauricioaniche ? So that the user can do:commit.added
->commit.added_lines
commit.removed
->commit.deleted_lines