src-d / gitbase

SQL interface to git repositories, written in Go. https://docs.sourced.tech/gitbase
Apache License 2.0
2.06k stars 124 forks source link

Natural join seems to eliminate rows which it shouldn't #977

Open alexpdp7 opened 4 years ago

alexpdp7 commented 4 years ago
MySQL [gitbase]> select blob_hash, repository_id from blobs natural join repositories where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
+------------------------------------------+-------------------------------------+
6 rows in set (14.90 sec)

MySQL [gitbase]> select blob_hash, repository_id from blobs where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/gitbase            |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/go-mysql-server    |
+------------------------------------------+-------------------------------------+
8 rows in set (0.13 sec)

also note that removing the natural join makes things go much faster- it was my understanding that normally we want to join with repositories to benefit from some specific optimizations (although I'm guessing that filtering with blob_hash makes those optimizations moot).

erizocosmico commented 4 years ago

Normally we don't want to join with repositories unless there are already joins involved. When querying a single table like blobs, they usually have other optimizations in place.

For example, blobs with a filter like blob_hash IN list only reads the given blobs in each repository. That's why no join it's faster.

As with everything: it depends on the query, depending on what you want, some optimizations may be better than others for performance.

In any case, I reproduced the bug and there's actually an issue. It seems to not return the repeated rows for some reason.

alexpdp7 commented 4 years ago

Yeah, I suspected something like that. Anyway, for my use case lack of duplicated rows is not an issue, so for my this is not high priority.

erizocosmico commented 4 years ago

This bug is really weird. The natural join is the one returning the correct result. If you remove the optimization in blobs table it returns the same. So, there something going on because repo.BlobObjects() doesn't return these blobs, but accessing them directly does

erizocosmico commented 4 years ago

@alexpdp7 are you using siva files got from gitcollector?

I tried with regular repositories and it didn't happen.

alexpdp7 commented 4 years ago

Yup, it's using siva

erizocosmico commented 4 years ago

Narrowed it down to a siva issue and reported it to go-borges: https://github.com/src-d/go-borges/issues/90, so leaving this as blocked until it's solved on their side.