mauricioaniche / repodriller

a tool to support researchers on mining software repositories studies
174 stars 39 forks source link

Filter diffs based on user-defined criteria #104

Closed ayaankazerouni closed 6 years ago

ayaankazerouni commented 7 years ago

(edited for more detail)

The problem

Considered solutions

'Reuse' a CommitFilter

If OnlyModificationsWithFileTypes or other file based commit filters are used, we can filter DiffEntries based on those specified extensions. Clearly the user only cares about those files.

Examine the mime-type of the file, according to the standard

This is relatively inexpensive by my tests. Using the Java core method Files.probeContentType is apparently pretty unreliable. It just returns null for my binary files (presumably because it doesn't recognise the .dat extension). We could also use tika-core. I've tried this and it can accurately tell the mime-type of a file quickly (non-human-readable map to application/octet-stream). The con here is that it adds a dependency to RepoDriller.

Where to implement the solution

When we have a List<DiffEntry> for a commit, we can only convert the DiffEntry to a modification if it passes a set of selection criteria. This is similar to commit filters, with the only difference being it works within commits.

Thoughts?

davisjam commented 7 years ago

Option 1: A filter If I understand correctly, you want a way to control which Modifications RepoDriller attempts to extract and store in a Commit. It sounds like you're willing to do this on per-file basis.

I think JGit has a way to do this, see here.

The question is, how much information do you need to apply your filter? Without performing a checkout, I suspect the only relevant information we have is the name of the affected file.

Option 2: Don't get the diff of binary files Standard command-line git just says "binary files differ" if you ask it. You have to deliberately interpret the binary file as text in order to obtain a diff. I can't imagine a good reason for a RepoDriller user to really complete diffs of binary files. Can we modify the code that extracts diffs to skip binary files?

I suspect that the second option will be less invasive and more stable than a filter based on file names.

ayaankazerouni commented 7 years ago

Initial searching says that Git does this by inspecting the contents of the file. We don't want to do that, so we're back to option 1, where we really only have the old and new file names to work with (getOldPath and getNewPath). And this is fine for my use case.

mauricioaniche commented 7 years ago

Nice feature request. I also agree that diffs of binary files are useless. I was gonna suggest a simpler solution: the user passes something like .filesThatICareAbout("java", "jsp", "css", "html") (name to discuss) and these are the files we bring to the Modification object. Maybe this can even be more fine-grained, like filesThatIWannaSeeInModification and filesThatNeedDiff. Your solution seems more robust when it comes to define binary files, however less flexible (as it would not allow me to ignore *.css files, for example). Maybe we should mix both!

This is definitely related to #41.

Thoughts?

davisjam commented 7 years ago

I'm concerned about a whitelist-only approach. The repositories I own often contain scripts with no extension at all, e.g. extractNames which is a perl/python/node/etc. script. You can't determine the type from the name, you need to see the shebang at the top of the file.

If we offer the user a generic filter that takes a file name, he can build his own whitelist or blacklist. We could also offer specific blacklist and whitelist implementations of this filter that he can instantiate with just a list of strings.

ayaankazerouni commented 7 years ago

We could also offer specific blacklist and whitelist implementations

I agree, this would make things easier on the user.

mauricioaniche commented 6 years ago

Did you manage to work on this one, @ayaankazerouni ?

ayaankazerouni commented 6 years ago

I added a filter to only look at Java files on a branch in my fork. See 40bb6f2ac1728a4d32f6a157758c634bb1dd97d0. But this is not really usable outside of my own needs.

I can work on the more general whitelist/blacklist solution for completeness' sake in the coming weeks (though that might be delayed because of end of semester work).

mauricioaniche commented 6 years ago

Fixed by #121