ssc-oscar / lookup

A mirror of bitbucket.org/swcs/lookup
1 stars 4 forks source link

How are author IDs determined? #5

Closed k----n closed 4 years ago

k----n commented 4 years ago

I counted the lines of the following output: zcat /da0_data/basemaps/gz/a2cFullS.* | awk 'BEGIN{FS=OFS=";"}{NF--; print}' | uniq and got 47246245.

According to https://bitbucket.org/swsc/overview/src/master/README.md there should be 47247366 author IDs.

Are there some author IDs not referencing a commit? It seems strange to me that there are 1121 less authors when I query the authors mapped to commits.

audrism commented 4 years ago

awk is not so good with spaces: try

zcat /da0_data/basemaps/gz/a2cFullS.* | cut -d\; -f1 | uniq | wc

also

zcat /da0_data/basemaps/gz/aS.s | wc

k----n commented 4 years ago

So I sorted and did a diff and it turns out awk doesn't handle things with different encodings very well. :(

The reason why I used awk in the first place was to account for any edge case where an author ID might have ;. But I ran grep on zcat /da0_data/basemaps/gz/aS.s and it doesn't appear that any author IDs have ;.

audrism commented 4 years ago

Author ids have ; replaced with space when extracted from the commit: see woc.pm:265

k----n commented 4 years ago

Thanks, https://github.com/ssc-oscar/lookup/blob/98e9aaee0c0f47f02049de40cd291fd393e9515d/woc.pm#L246 (for anybody else that might read this issue).

It also looks like ; is replaced with space for commit messages, which also lets me use cut with showCnt commit 2.

Really appreciate all the help you've given so far for understanding WoC. 👍