Closed k----n closed 4 years ago
awk is not so good with spaces: try
zcat /da0_data/basemaps/gz/a2cFullS.* | cut -d\; -f1 | uniq | wc
also
zcat /da0_data/basemaps/gz/aS.s | wc
So I sorted and did a diff and it turns out awk doesn't handle things with different encodings very well. :(
The reason why I used awk in the first place was to account for any edge case where an author ID might have ;
.
But I ran grep on zcat /da0_data/basemaps/gz/aS.s
and it doesn't appear that any author IDs have ;
.
Author ids have ; replaced with space when extracted from the commit: see woc.pm:265
Thanks, https://github.com/ssc-oscar/lookup/blob/98e9aaee0c0f47f02049de40cd291fd393e9515d/woc.pm#L246 (for anybody else that might read this issue).
It also looks like ;
is replaced with space for commit messages, which also lets me use cut with showCnt commit 2
.
Really appreciate all the help you've given so far for understanding WoC. 👍
I counted the lines of the following output:
zcat /da0_data/basemaps/gz/a2cFullS.* | awk 'BEGIN{FS=OFS=";"}{NF--; print}' | uniq
and got 47246245.According to https://bitbucket.org/swsc/overview/src/master/README.md there should be 47247366 author IDs.
Are there some author IDs not referencing a commit? It seems strange to me that there are 1121 less authors when I query the authors mapped to commits.