vcu-swim-lab / KnowHows

A Slack search engine for your repos.
http://knowho.ws
MIT License
7 stars 3 forks source link

use SrcML for parsing diffs #20

Closed damevski closed 6 years ago

damevski commented 6 years ago

try parsing diffs with SrcML first, and if that doesn't work, probably parse entire files with SrcML then correlate to the diff (or figure out what to do with diffs, in general)

AlexAplin commented 6 years ago

SrcML.NET is out there, but it's not maintained and using an old version of srcML that uses two executables. I think we can accomplish what we need with a few custom external calls to the current version.

damevski commented 6 years ago

i agree. let's use the latest version of SrcML

AlexAplin commented 6 years ago

We can now generate srcML documents of type XDocument using the functions in c05d7b652bc302fd3199a9069c7883496fdd4037. srcML must be in your PATH. TODO:

AlexAplin commented 6 years ago

After trying different diffs, I think we will run into problems trying to parse them in isolation. You can't count on the context provided. We should instead parse the patched file for each commit in SrcML with the --position flag, which makes it possible to then correlate with the diffs on line and column number. Some psuedocode:

For each file in commit_files
    keyword_list = []
    srcMLdoc = raw_url parsed with srcML
    Filter to only include useful nodes
    For each @@ ---- @@ diff block in patch
        filtered_diff = filtered for additions
        For each line in filtered_diff
            Find nodes matching pos:line in srcMLdoc
            Add to keyword_list
AlexAplin commented 6 years ago

We can now process hunk blocks in unified diffs to get the line additions for files. When processing files, we should check the status for each file in the commit and only process on modified and created files, as removed files are irrelevant. The next step is now correlating line numbers with full files parsed by SrcML to pick out the values we want.