src-d / gitbase

SQL interface to git repositories, written in Go. https://docs.sourced.tech/gitbase
Apache License 2.0
2.06k stars 124 forks source link

Idea: per-partition joins #716

Open erizocosmico opened 5 years ago

erizocosmico commented 5 years ago

Right now, since 1 partition means 1 repository, we know joins (by repository) can only happen in the same partition. Instead, we iterate and try to join with all of the partitions together.

Imagine we have 3 partitions.

These are the rows returned by each partition in the left side of a join.

These are the rows returned by each partition in the right side of a join.

We are joining 100 rows with 100 rows, which produces 10000 rows, that are then filtered by the join conditions (but we still make those 10k iterations).

Instead, if we did this per partition, these would be the produced rows (then filtered by conditions):

The total amount of rows produced is 4050 rows, which is a 40% of the number of rows generated before. This number grows enormously as the number of partitions and rows grow.

What could we do?

A rule that runs at the end of the analysis and transforms joins (the ones left after the squash) into something like:

Concat
 |- InnerJoin
    |- PartitionTable(TableA)
    |- PartitionTable(TableB)

PartitionTable is a table that will only return the rows for one partition. Concat is a node that will iterate over all partitions and transform all its Table children into PartitionTable. Then, all the rows of each partition will be put together and returned to the user. This will also happen in parallel. Essentially, Concat is like an Exchange. The only thing it differs is the fact that it can handle binary nodes and not only unary nodes. This is something that cannot be done in go-mysql-server but can be done here, since we know for certain that partitions are the same for each table.

Called it Concat but the name is pretty lame so we should think of a better name, like PartitionExchange, BinaryExchange or something like that.

This should make (not squashed) joins —and in a real life applications you will have many of them because leaves will be subqueries— much much faster.

ajnavarro commented 5 years ago

If I understood correctly, this will be in that way only if the joins conditions are relations that are only possible inside one repository, right?

In that case, what is the difference between this parallelized join and squashed tables?.

erizocosmico commented 5 years ago

@ajnavarro yup

This is tailored for joins after squashing. There are cases in which you will have joins for sure, for example, when you are joining the result of two subqueries, which is a pretty common thing to have in any real world application.

Consider the following query:

SELECT uast_extract(
    uast(blob_content, 'PHP', "//uast:String"),
    '@pos') AS positions,
    repository_id,
    file_path
FROM (
    SELECT f.repository_id,
        f.file_path,
        b.blob_content
    FROM (
        SELECT *
        FROM refs r
        NATURAL JOIN commit_blobs cb
        NATURAL JOIN blobs
        WHERE r.ref_name = 'HEAD'
            AND NOT IS_BINARY(blob_content)
    ) b
    INNER JOIN (
        SELECT repository_id, file_path, blob_hash
        FROM refs r
        NATURAL JOIN commit_files cf
        WHERE r.ref_name = 'HEAD'
    ) f
    ON b.blob_hash = f.blob_hash
        AND b.repository_id = f.repository_id
    WHERE language(f.file_path, b.blob_content) = 'PHP'
) t
WHERE positions IS NOT NULL

This cannot be squashed. Because of it, you're performing a massive cross join on a lot of data if you run it on a big dataset. If you do the joins per-partition, you're generating way less data and thus making it much faster. It also has another advantage: since we know only rows from the same partition will match, in-memory joins become more feasible. You don't need to keep in memory all rows of one side, only all rows of a partition. Then do next partition and so on.