Open alexpdp7 opened 5 years ago
Normally we don't want to join with repositories unless there are already joins involved. When querying a single table like blobs, they usually have other optimizations in place.
For example, blobs with a filter like blob_hash IN list only reads the given blobs in each repository. That's why no join it's faster.
As with everything: it depends on the query, depending on what you want, some optimizations may be better than others for performance.
In any case, I reproduced the bug and there's actually an issue. It seems to not return the repeated rows for some reason.
Yeah, I suspected something like that. Anyway, for my use case lack of duplicated rows is not an issue, so for my this is not high priority.
This bug is really weird. The natural join is the one returning the correct result. If you remove the optimization in blobs table it returns the same. So, there something going on because repo.BlobObjects()
doesn't return these blobs, but accessing them directly does
@alexpdp7 are you using siva files got from gitcollector?
I tried with regular repositories and it didn't happen.
Yup, it's using siva
Narrowed it down to a siva issue and reported it to go-borges: https://github.com/src-d/go-borges/issues/90, so leaving this as blocked until it's solved on their side.
also note that removing the natural join makes things go much faster- it was my understanding that normally we want to join with repositories to benefit from some specific optimizations (although I'm guessing that filtering with blob_hash makes those optimizations moot).