#count reviewers by lang and sectors by lang of the reviewers #778

sylvaticus opened 3 years ago

sylvaticus commented 3 years ago

If you ever got curious. JOSS reviewers data from the public list.

*** The 20 most "best known" languages...
- python         ( 68.74 %)
- r              ( 27.52 %)
- c++            ( 18.85 %)
- c              ( 13.91 %)
- matlab         ( 8.3 %)
- java           ( 7.26 %)
- fortran        ( 5.76 %)
- javascript     ( 4.79 %)
- julia          ( 4.71 %)
- bash           ( 3.07 %)
- go             ( 2.02 %)
- perl           ( 1.65 %)
- c#             ( 1.57 %)
- rust           ( 1.5 %)
- php            ( 1.5 %)
- ruby           ( 1.27 %)
- sql            ( 1.12 %)
- scala          ( 0.9 %)
- haskell        ( 0.82 %)
- cuda           ( 0.75 %)
*** The 20 most "known" languages...
- python         ( 79.43 %)
- r              ( 33.88 %)
- c++            ( 31.41 %)
- c              ( 27.3 %)
- matlab         ( 17.88 %)
- java           ( 16.45 %)
- javascript     ( 12.86 %)
- fortran        ( 10.62 %)
- julia          ( 8.45 %)
- bash           ( 6.36 %)
- perl           ( 4.49 %)
- php            ( 3.89 %)
- c#             ( 3.66 %)
- go             ( 3.14 %)
- rust           ( 2.99 %)
- ruby           ( 2.84 %)
- sql            ( 2.24 %)
- scala          ( 2.09 %)
- html           ( 1.72 %)
- haskell        ( 1.5 %)
*** The 4 most common sectors for the 10 most "known" languages...
python      :   machine learning, bioinformatics, physics, statistics, 
r           :   bioinformatics, machine learning, statistics, genomics, 
c++         :   machine learning, bioinformatics, physics, statistics, 
c           :   machine learning, bioinformatics, astrophysics, statistics, 
matlab      :   machine learning, image processing, statistics, physics, 
java        :   machine learning, bioinformatics, software engineering, data science, 
javascript  :   machine learning, bioinformatics, data science, statistics, 
fortran     :   physics, astrophysics, computational fluid dynamics, computational chemistry, 
julia       :   machine learning, statistics, physics, data science, 
bash        :   bioinformatics, genomics, machine learning, computational biology, 
Generated with the above code (Julia) ```julia # Source: reviewer database of JOSS at using OdsIO # Loading data.. dataFile = "joss_reviewers_20200724.ods" db = ods_read(dataFile,range=((4,2),(1340,9))) # removing email db = hcat(db[:,1:2],db[:,5:end]) # replacing "nothing".... # ..with empty string in the first three columns... for r in eachrow(db) for cidx in 1:3 r[cidx] = isnothing(r[cidx]) ? "" : r[cidx] end end # ..and with zero in the number of reviews... for r in eachrow(db) for cidx in 4:6 r[cidx] = isnothing(r[cidx]) ? 0 : r[cidx] end end # Converting first 3 columns to string and last 4 to integers db = convert(Array{Union{String,Int64},2},db) # Cleaning.. for r in eachrow(db) for cidx in 1:3 # ugly... r[cidx] = replace(replace(replace(replace(replace(r[cidx], '/'=>','), '('=>','), ')'=> ','), '\n'=> ',') , "and"=> ',') |> strip |> lowercase r[cidx] = replace(r[cidx],", " => ',') # to avoid empty data r[cidx] = replace(r[cidx]," ," => ',') # to avoid empty data r[cidx] = replace(r[cidx], r",$" => "") # remove ending comma end end # Establishing vocabolaries vocLangs = Set{String}() vocActivities = Set{String}() for (ridx,r) in enumerate(eachrow(db)) ##if ridx > 20 break end for cidx in 1:2 #= debug = strip.(split(r[cidx],',')) for l in debug if l == "" println(l) println(ridx) println(cidx) end end =# if r[cidx] == "" continue end push!(vocLangs,strip.(split(r[cidx],','))...) end for cidx in 3:3 if r[cidx] == "" continue end push!(vocActivities,strip.(split(r[cidx],','))...) end end vocLangs = collect(vocLangs) vocActivities = collect(vocActivities) langIdx = Dict{String,Int64}() [langIdx[l] = id for (id,l) in enumerate(vocLangs)] actIdx = Dict{String,Int64}() [actIdx[a] = id for (id,a) in enumerate(vocActivities)] nLangs = length(vocLangs) nActs = length(vocActivities) nRecords = size(db,1) preferredLangCount = zeros(Int64,nLangs) competentLangCount = zeros(Int64,nLangs) actCountByLang = zeros(Int64,nLangs,nActs) # Let's count! for r in eachrow(db) plangs = strip.(split(r[1],',')) olangs = strip.(split(r[2],',')) langs = union(Set(plangs),Set(olangs)) acts = strip.(split(r[3],',')) [preferredLangCount[langIdx[l]] += 1 for l in plangs if l != ""] [competentLangCount[langIdx[l]] += 1 for l in langs if l != ""] [actCountByLang[langIdx[l],actIdx[a]] += 1 for l in langs, a in acts if l != "" && a != ""] end # Let's report: n = 20 println("*** The $n most \"best kwown\" languages...") sortIdx = reverse(sortperm(preferredLangCount))[1:n] [println("- $(rpad(vocLangs[i],12))\t ( $(round(100*preferredLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx] n = 20 println("*** The $n most \"known\" languages...") sortIdx = reverse(sortperm(competentLangCount))[1:n] [println("- $(rpad(vocLangs[i],12))\t ( $(round(100*competentLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx] n = 10 n2 = 4 println("*** The $n2 most common sectors for the $n most \"known\" languages...") sortIdx = reverse(sortperm(competentLangCount))[1:n] for i in sortIdx lang = vocLangs[i] sortIdxActs = reverse(sortperm(actCountByLang[i,:]))[1:n2] print("$(rpad(lang,12)): \t") [print("$(vocActivities[j]), ") for j in sortIdxActs] print("\n") end ```
arfon commented 3 years ago

✨ thanks @sylvaticus! / cc @diehlpk who has been looking at the breakdown of languages of papers we've reviewed too.

diehlpk commented 3 years ago

Ok, this would be interesting to add these to the paper and compare with the programming languages the repos had.