Open sylviesworld opened 3 years ago
@dlomax64 could you remind what they should contain (i.e. schema)?
I think that the original b2cPtaPkgR_all table will work for what I want to do with it: b2cPtaPkgR_all: | name | type |
---|---|---|
blob | FixedString(20) | |
commit | FixedString(20) | |
project | String | |
time | UInt32 | |
author | String | |
language | String | |
deps | String |
If I remember correctly, Chris was wanting to be able to get downstream commits from a dependency so this table should be fine for that + some things I want to add for the website.
TY. Will let you know once done. My plan is to have c2PtabllfPkgS_all where 'llf' part stands for language,language,file (the second language is per determination of ctags while the first is based on file extension
name | type |
---|---|
commit | FixedString(20) |
Project | String |
time | UInt32 |
author | String |
language | String |
language-from-ctags | String |
filename | String |
deps | String |
Does the language extension and ctags for one commit often differ?
In my experience with the c2PtabllfPkgS
map, ctags often raise the issue of existence rather than equality in comparison to WoC language. The WoC language field is never empty in the map, but 75,300,669 lines in the map do not contain a ctag.
For records where both language fields exist, they match a strong majority of the time.
Ctags also can become an issue when checking their value in scripts since they are not character-safe. For example, C# (the ctag for the C# language) may be a problem. The WoC equivelant is simply 'Cs', a safer string.
@audrism Can this table, or some other lookup program, be optimized for deps
queries by somehow splitting deps
into their own fields?
Currently, searching for all downstream commits for a pkg requires traversing all lines of each c2PtabllfPkgS
file.
A mapping ~/lookup/getValues/pkg2P
would also be very useful. I can raise a new issue suggesting this if you agree it is of interest, independent from this clickhouse table.
In my experience with the
c2PtabllfPkgS
map, ctags often raise the issue of existence rather than equality in comparison to WoC language. The WoC language field is never empty in the map, but 75,300,669 lines in the map do not contain a ctag.For records where both language fields exist, they match a strong majority of the time.
Ctags also can become an issue when checking their value in scripts since they are not character-safe. For example, C# (the ctag for the C# language) may be a problem. The WoC equivelant is simply 'Cs', a safer string.
I included ctags classification mostly as a way to see where language types do not match, it may make sense to use linguist tool to identify language in such cases
I do not imagine the "original" c2PtabllfPkgS
being useful for Clickhouse since it is a time series oriented database. Most queries on time series (in the work I've done) aim to collect all commits, authors, or blobs from a project in a set period of time. The c2PtabllfPkgS
map would then report an inflated count for commits in the given period of time because some commits appear multiple times in the table (assuming records in the table would be identical to the existing files at /da1_data/basemaps/gz/c2PtabllfPkgFullS{0..127}.s
).
What would be very useful is a similar table that uses Project as the key, such as p2cbtaPkg
to allow for an accurate query of commits/blobs/authors in a given time frame per project.
Apart from the commit detail (and project) table in the CH noted above, it is not entirely clear what to add @noah-22 @dlomax64
a) One idea is to have monthly commit/blob counts (see issue #25) b) it seems APIs need to be firs-class object, so CH may not be a good target: create TCH table? c) I am thinking blob to first author/commit may be good CH use case since it has time?
@audrism Can you update the Clickhouse Tables?