src-d / datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Other
323 stars 82 forks source link

borges-indexer fails to run with database schema from latest borges version #48

Closed vmarkovtsev closed 6 years ago

vmarkovtsev commented 6 years ago

I run borges consumer and it writes several siva files and records to the DB successfully.

Then I run borges-indexer and get

FATA[0004] unable to get result set                      err="pq: column __repository._references does not exist"
vmarkovtsev commented 6 years ago

After updating core-retrieval to master, I get

INFO[0004] start processing repos                        workers=32
WARN[0004] empty repository                              repo=0162bb0b-5d2a-5a9c-62cf-5a81779e5db9
WARN[0004] empty repository                              repo=0162bb0b-5d28-7a05-3aca-46f5d0c88c1f
WARN[0004] empty repository                              repo=0162bb0b-5d2d-89ae-6355-d442830057ee
WARN[0004] empty repository                              repo=0162bb0b-5d2c-cd4a-dab7-2c92e8fa4043
WARN[0004] empty repository                              repo=0162bb0b-5d2e-9274-f90c-7063fb2ee658
INFO[0004] finished processing all repositories          failed=0 processed=5 total=5
erizocosmico commented 6 years ago

Borges schema changed, that’s why it’s failing On Thu, 12 Apr 2018 at 23:18, Vadim Markovtsev notifications@github.com<mailto:notifications@github.com> wrote:

After updating core-retrieval to master, I get

INFO[0004] start processing repos workers=32 WARN[0004] empty repository repo=0162bb0b-5d2a-5a9c-62cf-5a81779e5db9 WARN[0004] empty repository repo=0162bb0b-5d28-7a05-3aca-46f5d0c88c1f WARN[0004] empty repository repo=0162bb0b-5d2d-89ae-6355-d442830057ee WARN[0004] empty repository repo=0162bb0b-5d2c-cd4a-dab7-2c92e8fa4043 WARN[0004] empty repository repo=0162bb0b-5d2e-9274-f90c-7063fb2ee658 INFO[0004] finished processing all repositories failed=0 processed=5 total=5

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/src-d/datasets/issues/48#issuecomment-380947667, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQFF6yqOOFXkzaRvzJFEQm851CZpb9Fks5tn8SxgaJpZM4TSf6H.

vmarkovtsev commented 6 years ago

I need to find the proper commit where everything works. Is it possible in theory @erizocosmico or borges unsynced with indexer too much?

vmarkovtsev commented 6 years ago

p.dbRepo.References foreign key does not work for some reason. The schema seems to be in order...

vmarkovtsev commented 6 years ago

I looked through the code, everything looks fine but the foreign key is empty for some reason. I am really curious what the problem will be.

ajnavarro commented 6 years ago

borges versions that will work with the old schema are 0.11.x ones. You can get the borges binary from here: https://github.com/src-d/borges/releases/tag/v0.11.4

The old schema had the references in jsonb format on a column in repositories table, we didn't have foreign keys.

vmarkovtsev commented 6 years ago

@ajnavarro I updated the core-retrieval package in borges-indexer locally and ran it, it uses exactly the same version as the modern borges now. It compiled and almost worked as seen in the logs... Would it be hard to update borges-indexer or at least point me where to investigate? The schema is the same on both ends - this means there should be an easy thing to fix.

erizocosmico commented 6 years ago

There shouldn't really be anything to do in borges-indexer besides updating core-retrieval to the latest version.

vmarkovtsev commented 6 years ago

I assure you that this is what I did...

vmarkovtsev commented 6 years ago

I can post a DB dump here if you want.

erizocosmico commented 6 years ago

Don't worry, I'll take a look whenever I take this issue. For the time being, use borges 0.11.x as it's the version that we used when this was written.

vmarkovtsev commented 6 years ago

This means writing siva files again, but it looks like the only way now.

vmarkovtsev commented 6 years ago

@ajnavarro @erizocosmico bump

ajnavarro commented 6 years ago

I don't know if I'm wrong, but this is not a priority for us (@smola , @mcuadros ?). You can still use the borges version that we used to fetch PGA, and then use the borges indexer.

vmarkovtsev commented 6 years ago

We are going to present these tools to the community on May 30th and they are currently broken.

vmarkovtsev commented 6 years ago

The issue is aligned to https://github.com/src-d/okrs/issues/14

ajnavarro commented 6 years ago

Not at all in my opinion. The problem here is an outdated temporal tool created for a specific project is not working with the latest borges version. It's not working because we are updating and improving borges to reach that okr.

smola commented 6 years ago

@vmarkovtsev Is there any problem with presenting the process for PGA generation as using a specific borges and borges-indexer version? You can even link to the exact GitHub release pages with binaries. At least for boreges. We could also publish here a working binary of borges-indexer if needed.

I don't see a problem in presenting and documenting borges-indexer as what it is: a quick tool done for generation of the first version of the dataset and that is likely to not be present in the process for future versions of the dataset.

vmarkovtsev commented 6 years ago

@smola Recent borges versions include important bugfixes which allow to clone more repositories. Most of the people there have Windows and we do not provide binary releases for it. This means that they have to clone a repo to the specific directory under src, fetch the specific revision which is known to work, build it and run it. Updating borges-indexer would allow to at least exclude the step with checking out the specific revision and stick with go get one-liner.

ajnavarro commented 6 years ago

PROPOSAL

Update borges-indexer dependencies to make it work with the new schema on borges versions >= 0.12.x.

This changes will make borges-indexer fails with prior versions. On other words, with this new version will be impossible to make again the index file from the actual PostgreSQL-PGA database, that is using borges 0.11.x schema.

No other changes will be done on borges-indexer, like add new columns, just make it compatible with the new schema.

Caveats @smola @mcuadros ?

smola commented 6 years ago

@vmarkovtsev will you need to use the up-to-date borges-indexer with out current (old) PostgreSQL PGA database?

https://github.com/src-d/datasets/issues/48#issuecomment-388327027

vmarkovtsev commented 6 years ago

@smola There is no such need.