ssc-oscar / lookup

A mirror of bitbucket.org/swcs/lookup
1 stars 4 forks source link

Clickhouse tables #24

Open sylviesworld opened 3 years ago

sylviesworld commented 3 years ago

@audrism Can you update the Clickhouse Tables?

audrism commented 3 years ago

@dlomax64 could you remind what they should contain (i.e. schema)?

sylviesworld commented 3 years ago
I think that the original b2cPtaPkgR_all table will work for what I want to do with it: b2cPtaPkgR_all: name type
blob FixedString(20)
commit FixedString(20)
project String
time UInt32
author String
language String
deps String

If I remember correctly, Chris was wanting to be able to get downstream commits from a dependency so this table should be fine for that + some things I want to add for the website.

audrism commented 3 years ago

TY. Will let you know once done. My plan is to have c2PtabllfPkgS_all where 'llf' part stands for language,language,file (the second language is per determination of ctags while the first is based on file extension

name type
commit FixedString(20)
Project String
time UInt32
author String
language String
language-from-ctags String
filename String
deps String
sylviesworld commented 3 years ago

Does the language extension and ctags for one commit often differ?

noah-22 commented 3 years ago

In my experience with the c2PtabllfPkgS map, ctags often raise the issue of existence rather than equality in comparison to WoC language. The WoC language field is never empty in the map, but 75,300,669 lines in the map do not contain a ctag.

For records where both language fields exist, they match a strong majority of the time.

Ctags also can become an issue when checking their value in scripts since they are not character-safe. For example, C# (the ctag for the C# language) may be a problem. The WoC equivelant is simply 'Cs', a safer string.

noah-22 commented 3 years ago

@audrism Can this table, or some other lookup program, be optimized for deps queries by somehow splitting deps into their own fields?

Currently, searching for all downstream commits for a pkg requires traversing all lines of each c2PtabllfPkgS file. A mapping ~/lookup/getValues/pkg2P would also be very useful. I can raise a new issue suggesting this if you agree it is of interest, independent from this clickhouse table.

audrism commented 3 years ago
  1. Got it, that means making a package a first-class object, but is the original table of use (in that case it is not a CH topic)?
  2. just added to CH commit_all: the schema is (tc it commit time (unix seconds), taz is author timezone and tcz is commiter tz) (sha1 FixedString(20), time Int32, tc Int32, tree FixedString(20), parent String, taz String, tcz String, author String, commiter String, project String, comment String
audrism commented 3 years ago

In my experience with the c2PtabllfPkgS map, ctags often raise the issue of existence rather than equality in comparison to WoC language. The WoC language field is never empty in the map, but 75,300,669 lines in the map do not contain a ctag.

For records where both language fields exist, they match a strong majority of the time.

Ctags also can become an issue when checking their value in scripts since they are not character-safe. For example, C# (the ctag for the C# language) may be a problem. The WoC equivelant is simply 'Cs', a safer string.

I included ctags classification mostly as a way to see where language types do not match, it may make sense to use linguist tool to identify language in such cases

noah-22 commented 3 years ago

I do not imagine the "original" c2PtabllfPkgS being useful for Clickhouse since it is a time series oriented database. Most queries on time series (in the work I've done) aim to collect all commits, authors, or blobs from a project in a set period of time. The c2PtabllfPkgS map would then report an inflated count for commits in the given period of time because some commits appear multiple times in the table (assuming records in the table would be identical to the existing files at /da1_data/basemaps/gz/c2PtabllfPkgFullS{0..127}.s).

What would be very useful is a similar table that uses Project as the key, such as p2cbtaPkg to allow for an accurate query of commits/blobs/authors in a given time frame per project.

audrism commented 3 years ago

Apart from the commit detail (and project) table in the CH noted above, it is not entirely clear what to add @noah-22 @dlomax64

a) One idea is to have monthly commit/blob counts (see issue #25) b) it seems APIs need to be firs-class object, so CH may not be a good target: create TCH table? c) I am thinking blob to first author/commit may be good CH use case since it has time?