opsmill / infrahub-sdk-python

Apache License 2.0
3 stars 1 forks source link

task: Redesign or remove InfrahubTransform.collect_data() #54

Open ogenstad opened 3 weeks ago

ogenstad commented 3 weeks ago

Component

Python SDK, infrahubctl CLI

Task Description

Looking at the current transforms we have the below snippet as part of the relevant code. By looking at this we can observe that when calling the .run() method it will run .collect_data() if no data was provided so that the transform can get the data it needs.

class InfrahubTransform:
    name: Optional[str] = None
    query: str
    timeout: int = 10

    async def collect_data(self) -> dict:
        """Query the result of the GraphQL Query defined in self.query and return the result"""

        return await self.client.query_gql_query(name=self.query, branch_name=self.branch_name)

    async def run(self, data: Optional[dict] = None) -> Any:
        """Execute the transformation after collecting the data from the GraphQL query.
        The result of the check is determined based on the presence or not of ERROR log messages."""

        if not data:
            data = await self.collect_data()
        unpacked = data.get("data") or data

        if asyncio.iscoroutinefunction(self.transform):
            return await self.transform(data=unpacked)

        return self.transform(data=unpacked)

On the surface this looks quite fine, however in this state it's broken. The reason is that practically every time we run a transform we want to add input variables to the GraphQL query.

In many cases a transform would take a device name as input but it could also be something very simple like this:

query TagsQuery($tag: String!) {
  BuiltinTag(name__value: $tag) {
    edges {
      node {
        name {
          value
        }
        description {
          value
        }
      }
    }
  }
}

With this in mind if we go back to the transform code we can see that the .collect_data() method doesn't work as it only uses the query as is but never sends in the required variables. Today .collect_data() can never be used for the indented purpose. If we compare with the generators we see that the generators version of .collect_data() actually uses the params/variables:

    async def collect_data(self) -> dict:
        """Query the result of the GraphQL Query defined in self.query and return the result"""

        data = await self._init_client.query_gql_query(
            name=self.query,
            branch_name=self.branch_name,
            variables=self.params,
            update_group=True,
            subscribers=self.subscribers,
        )
        unpacked = data.get("data") or data
        await self.process_nodes(data=unpacked)
        return data

The reason why the transforms work today is that the data is always provided by an outside source. When running in the git-agent the data is sent to the .run() method: https://github.com/opsmill/infrahub/blob/infrahub-v0.16.2/backend/infrahub/git/integrator.py#L1221

The infrahubctl transform command does the same, i.e. it's querying for the data before using the InfrahubTransform.run() method.

As a note a reason for why we might want to be able to provide the data to .run() is that for example with the artifact generation we are running the query in a way that updates the GraphQLQueryGroup: https://github.com/opsmill/infrahub/blob/infrahub-v0.16.2/backend/infrahub/git/integrator.py#L1254-L1262

The Python checks also work closer to what the generators do today.

We need to reconsider how this should work and ensure that the behaviour is consistent across all types i.e. Transforms, Generators and Checks.

PhillSimonds commented 1 week ago

@ogenstad I have another comment to add here which may fold into this task. It looks like data can be collected via a stored query path (.query_gql_query on the InfrahubClient object) or with a query string (.execute_graphql on the InfrahubClient object). The existing methods assume a stored graphQL query path will exist. If we end up collecting data from the API using the transform run method, it would likely need to support both as the stored query path can't be guaranteed to exist.