use vulture to automatically remove dead code?

jendrikseipp commented 7 years ago

Originally reported by: Jendrik Seipp (Bitbucket: jendrikseipp, GitHub: jendrikseipp)

I'm adding this issue since the topic came up in a discussion around GSoC proposals for www.coala.io with @sils1297. The idea was to write a program that uses vullture to find and remove dead code.

After thinking some more about this, I think that the benefits of such a program are very small. For a program that changes code to be useful, most of the changes it proposes have to make sense. This is the case, e.g., for tools like autopep8. However, vulture (due to Python's dynamic nature) produces too many false positives. Even if it detects dead code, the underlying issue is not that code should be removed, but that e.g. a variable name is misspelled and therefore seems to be unused. Vulture is useful for hinting at possible programming errors, but I think almost all of them will have to be double-checked by the programmer.

Bitbucket: https://bitbucket.org/jendrikseipp/vulture/issue/25

jendrikseipp commented 7 years ago

Original comment by Lasse Schuirmann (Bitbucket: sils1297, GitHub: sils1297):

there will always be a coala mentor, don't worry. If there's none listed yt we're just not sure yet who it's going to be.

jendrikseipp commented 7 years ago

Original comment by Jendrik Seipp (Bitbucket: jendrikseipp, GitHub: jendrikseipp):

The link to my GH profile is correct.

The PR looks good to me, except that as I wrote there, it would probably be helpful if there was a mentor from the coala project.

jendrikseipp commented 7 years ago

Original comment by Max Scholz (Bitbucket: m0hawk, GitHub: Unknown):

Could you take a look at the project PR here: https://github.com/coala/projects/pull/190

jendrikseipp commented 7 years ago

Original comment by Max Scholz (Bitbucket: m0hawk, GitHub: Unknown):

@jendrikseipp is this your GH profile? https://github.com/jendrikseipp We link the GH profiles of mentors on our projects page (http://projecty.coala.io) thats why.

jendrikseipp commented 7 years ago

Original comment by Jendrik Seipp (Bitbucket: jendrikseipp, GitHub: jendrikseipp):

Sounds good to me too. I won't have time to work on the GSoC proposal, but I'd be happy to mentor a student. So if you like, feel free to turn the draft into a proposal, adding me (and others?) as mentors.

jendrikseipp commented 7 years ago

Original comment by John Vandenberg (Bitbucket: jayvdb, GitHub: jayvdb):

That sounds like a good sized project concept.

jendrikseipp commented 7 years ago

Original comment by Max Scholz (Bitbucket: m0hawk, GitHub: Unknown):

Maybe 1. could be done by giving the range until the next node starts. I have no experience with the ast but maybe somehow this could be done.

Anyway. We would love to have a vulture related project in our gsoc with you. As the application deadline is near, we'd have to get this done in the next few days.

So as a rough draft for a project, what do you think about this:

make vulture a lib
make the vulture bear use the new lib
spend a week or so researching if either the ast or maybe pyflakes enhanced ast can offer the source range
if last step is successfull implement the dead code removal in the coala bear
if not add more detection cases to vulture
either way, add confidence value to results.

The project would be 3 month of full time work by a student. If you are interested to mentor such a project, we'd ideally get it into a similar form as the other projects at http://projects.coala.io over the weekend.

jendrikseipp commented 7 years ago

Original comment by Lasse Schuirmann (Bitbucket: sils1297, GitHub: sils1297):

FWIW a GSoC project may cover anything you want and isn't limited to point one. Sounds like the GSoC project should cover 2, 3 and/or 4.

jendrikseipp commented 7 years ago

Original comment by Jendrik Seipp (Bitbucket: jendrikseipp, GitHub: jendrikseipp):

Thanks for the additional info! Collecting the things that were mentioned, I guess these are the items one could work on to allow for automatic dead code removal:

vulture should report ranges of dead code instead of just the beginnings of dead code
vulture could provide confidence values for reported dead code
ideally, vulture would provide the above infos as a Python library
optionally, vulture could detect more unreachable code (à la if False)

Feel free to tackle any of these issues :-) Depending on how complicated they make the codebase, I'd be open to integrate the resulting code into the master repository. In any case, I'll continue working on detecting more dead code (item 4).

Obviously, item 1 is critical for automatic code removal. I find it hard to judge whether it is feasible to accomplish in a non-hacky way since the Python ast module only provides the line numbers of beginnings of nodes. Before this is sorted out, I think it's too risky to offer this as a GSoC project, since it might be set to fail from the start.

jendrikseipp commented 7 years ago

Original comment by Lasse Schuirmann (Bitbucket: sils1297, GitHub: sils1297):

Hey @jayvdb pointed out another thing that I forgot to mention: in coala we have a confidence value for every result. ATM it doesn't have functional effect because not a lot of analysis supports it but it can be a useful tool especially when dealing with some situations where we can be more sure about correctness than in others. ATM it's just a value but in coala we can easily connect logic to it, filtering by it e.g.

jendrikseipp commented 7 years ago

Original comment by John Vandenberg (Bitbucket: jayvdb, GitHub: jayvdb):

As Lasse has said, how we share this problem domain between vulture and coala can be negotiaged with vulture choosing how far it wants to go, and coala can do the remainder (provided vulture can provide metadata about the range of code that is probably dead).

To look at some real use cases, the most common actionable case is very likely to be if False: and if True: ... ; else: ... blocks. Some programmers feel this is acceptable as a quick fix, and sometimes it may be appropriate, but these should be highlighted during pre-commit workflows, during code review, and able to be reconsidered again as part of maintenance reviews.

There will be some less clear cut cases, and many types of dead code detection which would need man-years of work to arrive at an algorithm that has enough signal to noise.

I would be happy if the initial implemention in vulture only caught if False: style problems. Once vulture has that 'fixer', the remainder of the initial project is easily filled with making that 'fixer' integrated into coala, and other UI design amd build in vulfure and coala.

Moreover, and where it would become especially interesting for vulture, by removing unmistakably dead code, additional variables, imports, functions etc, may be trimmed where they were only used by dead code. That is where the gold is. Each if False: block on its own may be justifiable, but if repeated many times there may be larger chunks of code being maintained only for use by multiple sets of unreachable code.

jendrikseipp commented 7 years ago

Original comment by Lasse Schuirmann (Bitbucket: sils1297, GitHub: sils1297):

Hi, thanks for bringing this up.

From our point of view having a potential patch is better than having no patch. The "coala way of accepting patches" is usually interactive, i.e. coala shows an issue, offers to edit the file, offers showing a potential patch and applying it - or just to put a generated ignore comment into your source code. Generally no patches are applied without the users consent. When working this way it does make sense to offer patches even if they're potentially not what the developer wants.

I guess actually performing the removal would probably be rather easy if vulture returns proper ranges of source code exactly identifying every piece of unused code.

jack17529 commented 7 years ago

@jendrikseipp FLAKE8 can be integrated I think - see this video https://www.youtube.com/watch?v=TDUf93vqq3g

jendrikseipp commented 7 years ago

@jack17529 Can you please write what your concrete suggestion is?

RJ722 commented 7 years ago

Hi! @jendrikseipp

I tripped on this:

vulture could provide confidence values for reported dead code

When we are talking about adding a confidence value to the results, on what basis and in what context could we provide such a data?

Can you please help me out? Thanks!

jendrikseipp commented 7 years ago

We will probably need to decide about this on a case-by-case basis, taking into account how Python functions, classes, imports, etc. can be declared and used. E.g., unused imports can be reported with 100% confidence, but unused functions can often be false-positives. I think a first step could be to separate the cases for which we can guarantee 100% certainty and the ones where we can't be sure.

RJ722 commented 7 years ago

Thank you @jendrikseipp That's great to start with! :smile: So, we can perhaps begin with these values:

Construct	Confidence
`import`	100%
`from foo import *`	0%
variables	< 100% (misspelled variable names, as you suggested)
functions	--
`class`	--
`if False` statement	--

Please correct me, if I am wrong!

Also, we can produce a patch for containing correctly-spelled variable name for the same, we can check for similarity in the var_name with all the variables name defined (using difflib - reference: here), if above a set threshold, we can adjust it in the patch.

RJ722 commented 7 years ago

Also, regarding source range acquisition, is it possible to implement this by fetching complete code between the current and the next node?

A simple google search also lead to ast_utils.py - I couldn't make a lot of sense out of it, but I guess what they do is pretty much what we need!

jendrikseipp commented 7 years ago

The idea of checking for misspelled names is interesting, but probably out of vulture's scope.

I don't see how we can use ast_utils.py for determining the range of a node. Instead of get_last_child(), we would need get_next_sibling(), I think.

From my point of view the confidence values are something like:

Construct	Confidence
`import`	100%
`from foo import *`	0%
variable	< 100%
function	< 100%
`class`	< 100%
`if False` statement	100%

I'm not sure how much value there is in creating finer grained distinctions than 100% vs. <100%.

Pacu2 commented 7 years ago

In couple of projects I've seen "unused" imports, which were used by templating engines. For example, you can import Django filters into Jinja2 filters file, and access them directly from the template, to do so, you don't need to list them anywhere else. So for me import is still < 100%.

RJ722 commented 7 years ago

Hello @Pacu2 I am quite unfamiliar with using Jinja2. But from what I understand, if we import some django filter :

from django import some_filter
import jinja2

loader = jinja2.FileSystemLoader('/tmp')
env = jinja2.Environment(autoescape=True, loader=loader)

env.filters['some_filter'] = some_filter
temp = env.get_template('template.html')
temp.render(name="testing")

Then we would still need to pass on the some_filter. So, this won't be an unused import after all. Reference - This stack overflow question - I may be completely wrong.

Can you please post an example! Thank You! :smile:

jendrikseipp commented 7 years ago

Automatic dead code removal is now a feature in coala (thanks to @RJ722), so this issue can be closed.

twmr commented 7 years ago

@jendrikseipp, you mean that the automatic dead code removal was introduced in https://github.com/coala/coala-bears/pull/2000?

jendrikseipp commented 7 years ago

Yes, exactly.

jendrikseipp / vulture

use vulture to automatically remove dead code? #25