Closed ayaankazerouni closed 6 years ago
Option 1: A filter
If I understand correctly, you want a way to control which Modification
s RepoDriller attempts to extract and store in a Commit
. It sounds like you're willing to do this on per-file basis.
I think JGit has a way to do this, see here.
The question is, how much information do you need to apply your filter? Without performing a checkout, I suspect the only relevant information we have is the name of the affected file.
Option 2: Don't get the diff of binary files Standard command-line git just says "binary files differ" if you ask it. You have to deliberately interpret the binary file as text in order to obtain a diff. I can't imagine a good reason for a RepoDriller user to really complete diffs of binary files. Can we modify the code that extracts diffs to skip binary files?
I suspect that the second option will be less invasive and more stable than a filter based on file names.
Initial searching says that Git does this by inspecting the contents of the file. We don't want to do that, so we're back to option 1, where we really only have the old and new file names to work with (getOldPath
and getNewPath
). And this is fine for my use case.
Nice feature request. I also agree that diffs of binary files are useless. I was gonna suggest a simpler solution: the user passes something like .filesThatICareAbout("java", "jsp", "css", "html")
(name to discuss) and these are the files we bring to the Modification
object. Maybe this can even be more fine-grained, like filesThatIWannaSeeInModification
and filesThatNeedDiff
. Your solution seems more robust when it comes to define binary files, however less flexible (as it would not allow me to ignore *.css files, for example). Maybe we should mix both!
This is definitely related to #41.
Thoughts?
I'm concerned about a whitelist-only approach. The repositories I own often contain scripts with no extension at all, e.g. extractNames
which is a perl/python/node/etc. script. You can't determine the type from the name, you need to see the shebang at the top of the file.
If we offer the user a generic filter that takes a file name, he can build his own whitelist or blacklist. We could also offer specific blacklist and whitelist implementations of this filter that he can instantiate with just a list of strings.
We could also offer specific blacklist and whitelist implementations
I agree, this would make things easier on the user.
Did you manage to work on this one, @ayaankazerouni ?
I added a filter to only look at Java files on a branch in my fork. See 40bb6f2ac1728a4d32f6a157758c634bb1dd97d0. But this is not really usable outside of my own needs.
I can work on the more general whitelist/blacklist solution for completeness' sake in the coming weeks (though that might be delayed because of end of semester work).
Fixed by #121
(edited for more detail)
The problem
Modifications
is the identified bottleneck in thegetCommit
method.Considered solutions
'Reuse' a CommitFilter
If
OnlyModificationsWithFileTypes
or other file based commit filters are used, we can filterDiffEntries
based on those specified extensions. Clearly the user only cares about those files.Examine the mime-type of the file, according to the standard
This is relatively inexpensive by my tests. Using the Java core method
Files.probeContentType
is apparently pretty unreliable. It just returns null for my binary files (presumably because it doesn't recognise the.dat
extension). We could also use tika-core. I've tried this and it can accurately tell the mime-type of a file quickly (non-human-readable map toapplication/octet-stream
). The con here is that it adds a dependency to RepoDriller.Where to implement the solution
When we have a
List<DiffEntry>
for a commit, we can only convert theDiffEntry
to a modification if it passes a set of selection criteria. This is similar to commit filters, with the only difference being it works within commits.Thoughts?