Suggest replacing `Tuple[X]` with `Tuple[X, ...]` or `Sequence[X]`

dhgarrette commented 2 years ago

Current problem

When Tuple[X] is used as a type annotation, it means "a tuple of length 1, whose (sole) element is of type X". However, since there are not very many situations in which one would genuinely want to specify something as being a 1-tuple, the vast majority of the time that someone writes Tuple[X], they are doing so out of a mistaken belief that Tuple[X] is analogous to List[X] ("a list of Xs"). The syntax for "a tuple of Xs of any length" is Tuple[X, ...].

Desired solution

It would be great if pylint were to flag uses of Tuple[X] and inform the user of the likely misunderstanding. In most cases, the user will likely be trying to express the notion "an (immutable) sequence of Xs", so the most appropriate suggestion would probably be Sequence[X]. However, the pylint message should also inform the user that if they do, indeed, want to be specific about the object being a tuple (of unknown length), then the correct syntax would be Tuple[X, ...].

Pierre-Sassoulas commented 2 years ago

I think this is a good idea, I already made the mistake myself.

DanielNoord commented 2 years ago

I might be totally wrong here but wouldn't Sequence be incorrect for this:

def fun(param: str, paramm: int) -> Tuple[str, int]:
    ret = tuple(str, int)
    return ret

I have been told that return value typing should be restrictive whereas argument typing should be as "open" as possible.

dhgarrette commented 2 years ago

@DanielNoord: A small clarification: The lint error I'm suggesting wouldn't trigger for your example since it doesn't contain any length-1 tuple type annotations (i.e., Tuple[str, int] is a perfectly appropriate use of Tuple, whereas if someone writes Tuple[str], that would likely indicate an error).

But the larger point you're making is totally valid: the suggested fix might be different depending on the context of the length-1 tuple annotation.

So, for example, given this input:

def sum_of_squares(xs: Tuple[int]) -> int:
  return sum(x**2 for x in xs)

it would be helpful for the linter inform the user that Tuple[int] means "a length-1 tuple containing an int", and suggest that they change it to Sequence[int].

On the other hand, given this input:

def evens(x: int) -> Tuple[int]:
  return tuple(range(0, x, 2))

perhaps it would be most appropriate to suggest that they change Tuple[int] to Tuple[int, ...].

Likewise, given this input:

xs: Tuple[int] = (1, 1, 2, 3, 5)

the best suggestion might also be to change Tuple[int] to Tuple[int, ...].

DanielNoord commented 2 years ago

But what about:

def func(string: str) -> Tuple[int]:
    return (int(string),)

For example when I know that later on I want to concatenate the return value of func onto another tuple. Can't there be cases where the return value actually is Tuple[int]? We could discuss whether pylint should suggest -> int and do the cast to tuple somewhere else, but I can see this happening

dhgarrette commented 2 years ago

Yes, that case is certainly possible, but I think that such intentional cases are really quite rare in comparison to uses of Tuple[X] that are due to a misunderstanding of its meaning. Anecdotally, in investigating this issue, I did a fair bit of grepping through code to see how widespread this problem is, and found a huge number of cases where people wrote Tuple[X] and then treated it as if it were variable length, and, in the sample of results I looked through, didn't see any cases where Tuple[X] was used in a way that really seemed to be intentional. Thus, I think that the lint error would be helpful in nearly all cases, and for those folks who already understand the semantics of Tuple[X] and want to do things like you're describing for some specific reason, they can always disable the lint error.

DanielNoord commented 2 years ago

While I totally agree that this is probably often misunderstood and it would be good to "fix" that I don't think we should add a checker of which we know there are unfixable false positives.

See some discussion in https://github.com/PyCQA/pylint/issues/746 about preferring false negatives over false positives. We really want to avoid false positives over false negatives. If we can come up with a good way to exclude the false positives without requiring users to add disable's I think this would be a very useful check, otherwise I'm against adding this.

Pierre-Sassoulas commented 2 years ago

I looked through, didn't see any cases where Tuple[X] was used in a way that really seemed to be intentional.

It's not surprising to me.

I don't think we should add a checker of which we know there are unfixable false positives.

I agree with the premise, but then we would need to detect the type of the return. First this is complicated, second It would be stepping on mypy's toes.

Having said that, I think returning a tuple of one element makes little sense and should never happen. If you're going to make the code evolve later, then maybe you ain't gonna need it, or you should keep it simple and stupid, or you should do it in one sweeping refactor ? If you can handle a variable length tuple already, then why you don't you do type it properly right now ? A one element tuple IS a multiple element tuple, so the updated typing would not be wrong. ... and it would prevent disastrous refactor if someone start to handle only one element and not multiples like the intent was.

when I know that later on I want to concatenate the return value of func onto another tuple.

More to your point, If you need to concatenate the tuple later then I think this:

def func(string: str) -> Tuple[int]:
    return (int(string),)

func(1) + (2,)

Is less coherent than:

def func(string: str) -> int:
    return int(string)

(func("1"),) + (2,)

If you're adding to a tuple only once why do you create a function for this one time ? The function is easier to reuse if you return one element. It's also less coherent than this:

def func(string: str, int_tuple: Tuple[int, ...]) -> Tuple[int, ...]:
    return (int(string),) + int_tuple

func("1", (2,))

Because if you need to add tuple together multiple time, why would you duplicate the addition ?

So imo, a check that would say "return a tuple of variable length or a single element" will always be right, if we consider that returning a one element tuple is problematic in term of design. We might call that a refactoring checks, but imo it will be massively useful as a hidden convention or warning check with "false positive" (that are in fact refactor message) because I think the number of voluntary one element Tuple will be very low indeed.

jacobtylerwalls commented 2 years ago

I agree that this check has value. This is another are-you-sure message like method-cache-max-size-none where it's possible the usage is totally fine, but there's a "gotcha" or "easy misuse" to be aware of that you probably didn't intend. But the warning makes enough sense for enough people that we don't need to demote it all the way to an extension like while-used.

Pierre-Sassoulas commented 1 year ago

Here's how ChatGPT would implement this feature using the original description:

from astroid import MANAGER, FromAstroidStrategy
from pylint.checkers import BaseChecker
from pylint.interfaces import IAstroidChecker

class TupleAnnotationChecker(BaseChecker):
    __implements__ = IAstroidChecker

    name = 'tuple-annotation'
    priority = -1
    msgs = {
        'W9902': (
            'Tuple[X] is likely a mistake, use Sequence[X] instead',
            'tuple-annotation',
            'Tuple[X] is used to represent a 1-tuple, but is often used by mistake as an analog to List[X]. Consider using Sequence[X] instead.'
        ),
    }

    def visit_call(self, node):
        if node.func.as_string() == 'Tuple':
            if len(node.args) == 1:
                self.add_message('tuple-annotation', node=node)

def register(linter):
    MANAGER.register_transform(astroid.Call, astroid.extract_node, FromAstroidStrategy)
    linter.register_checker(TupleAnnotationChecker(linter))

I told it that : "__implements__ is an old PEP that was rejected in 2006, please don't use that. You need to inherit from the right class.". and got a new version:

from pylint.checkers import BaseChecker
from pylint.interfaces import IAstroidChecker
from astroid import ALL_OPTIONAL_NODES

class TupleAnnotationChecker(BaseChecker):
    __implements__ = IAstroidChecker

    name = 'tuple-annotation-checker'
    msgs = {
        'W0001': ('Tuple[X] is likely a mistake, use Sequence[X] instead',
                  'tuple-annotation',
                  'Used when Tuple[X] is used as a type annotation.'),
    }

    priority = -1
    options = ()

    def visit_annassign(self, node):
        if (node.annotation and
            node.annotation.as_string().startswith('Tuple[') and
            not node.annotation.as_string().endswith('...]')):
            self.add_message('tuple-annotation', node=node)

def register(linter):
    linter.register_checker(TupleAnnotationChecker(linter))

And I asked for functional tests but I'll have to make do with that unit test looking proposition:

def test_tuple_annotation_checker(lint_output):
    code = '''
def foo(a: Tuple[int]):
    pass
'''
    messages = lint_output(code)
    assert len(messages) == 1
    assert messages[0].msg_id == 'tuple-annotation'
    assert 'Tuple[X] is likely a mistake' in messages[0].message

def test_tuple_annotation_checker_correct_usage(lint_output):
    code = '''
def foo(a: Tuple[int, ...]):
    pass
'''
    messages = lint_output(code)
    assert len(messages) == 0

The lint_output function is a fixture that should run the code through Pylint and return the list of messages generated by the linter. The test asserts that the plugin generates the correct number of messages for the correct code, and that the messages contain the correct text.

Anyway ; that was fun. But I think we'll still have to implement this. (Although it sound like the kind of fixture it invented would actually be nice to have. )

jacobtylerwalls commented 1 year ago

@Pierre-Sassoulas can you get ChatGPT to fix our namespace package handling or add some control flow? lol.

Pierre-Sassoulas commented 1 year ago

Q:

Do you have any pointers about how to fix namespace handling or control flow in pylint's codebase ?

A:

(...) Here are some general steps you could follow to modify Pylint's behavior: Familiarize yourself with the codebase: (...) Modify the code (...) Test your changes (...) It's important to note that modifying Pylint's behavior can be a complex task, and it's recommended that you have a good understanding of both Python and the ast module before attempting to make changes to Pylint's codebase.

Alas, we'll still need a world leading domain expert engineer like you for that, heh 😛

pylint-dev / pylint

Suggest replacing `Tuple[X]` with `Tuple[X, ...]` or `Sequence[X]` #5578

Current problem

Desired solution