mathesar-foundation / mathesar

Web application providing an intuitive user experience to databases.
https://mathesar.org/
GNU General Public License v3.0
2.32k stars 320 forks source link

Improve type inference tree #2371

Open dmos62 opened 1 year ago

dmos62 commented 1 year ago

Meta issue: #2347

By type inference tree, we're referring to this: https://github.com/centerofci/mathesar/blob/41995b09416886d18f9a13b3306b0fe146d3b2b7/db/columns/operations/infer_types.py#L17

Two ways to improve it:

shrey27tri01 commented 1 year ago
  • add end conditions:
    • an empty element, like PostgresType.BOOLEAN: [], breaks the loop, when a cast to BOOLEAN is successful;

@dmos62 a small doubt here. When a cast to BOOLEAN is successful, it means that the recursive call to infer_column_type inside the loop returns PostgresType.BOOLEAN: [], which reflects the best type for the column. Isn't this ideal behavior? In that case, what do you mean by adding "end conditions"? Let me know if I am missing something.

dmos62 commented 1 year ago

@shrey27tri01 I think you're right, I misspoke in the description. Unless a type you've successfully cast to is a key in the TYPE_INFERENCE_DAG dict (or its value in the dict is an empty list), that will end the inference. Well done examining this. This also means that we can remove all the TYPE_INFERENCE_DAG dict key-value pairs where the value is an empty list, without changing the algorithm's behaviour.

This doesn't necessarily change the task much: fundamentally we want to make TYPE_INFERENCE_DAG more optimal by manipulating recursion and end-conditions (presuming it's not optimal already).

@mathemancer can you sanity check me here?

shrey27tri01 commented 1 year ago

@dmos62 I'm assuming by adding end-conditions, you mean defining a key with an empty list for the types for which it hasn't already been done in the TYPE_INFERENCE_DAG, for example, adding PostgresType.TIMESTAMP_WITHOUT_TIME_ZONE: [] and MathesarCustomType.URI: [] to the DAG. But doing so would be pointless, since like you said, we can remove all the TYPE_INFERENCE_DAG dict key-value pairs where the value is an empty list, without changing the algorithm's behaviour. Do let me know if I'm missing something here.

Regarding the second way you mentioned (reducing the breadth of search), we can add some checks to reduce the number of computations, like so:

def can_cast(from_type, to_type:):
    """
    Determines if a cast from one data type to another is valid.
    """
    if from_type == to_type:
        return True
    if to_type in TYPE_INFERENCE_DAG[from_type]:
        return True
    if not TYPE_INFERENCE_DAG[from_type]:
        return False
    for intermediate_type in TYPE_INFERENCE_DAG[from_type]:
        if can_cast(intermediate_type, to_type):
            return True
    return False

Let me know if this is what you had in mind, or otherwise.

Also, there might be other performance improvements, like using a more efficient data structure (like networkx), pruning unnecessary edges, caching, or grouping similar types together, but since the DAG is not very complicated and there are a limited number of possible nodes, I believe pursuing them would not bring much to the table.

dmos62 commented 1 year ago

@shrey27tri01 the predicate can_cast seems like a reformulation of what the code around TYPE_INFERENCE_DAG already does, unless I'm misimagining how you would use it. I'm not against clearer logic, but the goal (unstated goal, I now realise) of this issue is to improve the performance of type inference by manipulating the type inference tree (and/or supporting logic).

I don't think that a different data structure would help with performance much. Overwhelming majority of time is spent during casting.

You seem to have a good grasp of this. I encourage you to try and prepare a solution.

dmos62 commented 1 year ago

Possibly relevant changes in #2681 and #2671.

github-actions[bot] commented 11 months ago

This issue has not been updated in 90 days and is being marked as stale.