openeuler-riscv / oerv-team

OERV 实习生工作中心
8 stars 36 forks source link

统计软件所在社区的贡献所占比例 #910

Open jiewu9823 opened 1 month ago

jiewu9823 commented 1 month ago

需求:

使用 python 编写脚本,去抓取 gitee.com/openeuler . gitee.com/src-openeuler 下面所有时间段 pr ,带有 riscv 或者 risc-v (不区分大小写)的关键词,以及所有这些 pr 的贡献邮箱,统计这些邮箱 带有 iscas.ac.cn 后缀的比例

要求:

1.要生成一个excel表格,表格中的栏位包括 owner,repo,status,created date,owner email ,belong to iscas 2.将生成的 excel 表格和编写的 python 脚本附在本issue的评论中或者将他们的链接写在评论里

Jingwiw commented 1 month ago

在 git commit log 里面肯定有的,可以试试对所有 riscv 相关的 pr 都 git clone 一下

Jvlegod commented 1 month ago

进度汇报 (尚未完成):

  1. 目前脚本已经基本完成,但是由于数据过大 git clone 过程可能耽误时间较长 (脚本代码层面已经想办法优化过了),正在挨个 clone (暂时没啥好的方法)
  2. 脚本争对任何 repo 应该是都适用的
  3. 不完善的点:由于难以获取邮箱,需要从git log中获取,但是①有些同学在本地git的邮箱可能不是iscas的后面需要进一步手动筛选②有些同学本地的git名称和gitee不一致也需要手动筛选。需要手动筛选的人员在表格中都标记为'unkown'状态
  4. 后续有时间准备完善更灵活的脚本
  5. 目前正在该利用脚本爬数据

脚本代码

import os
import sys
import git
from openpyxl import Workbook
import argparse
import re
import subprocess
import requests
from bs4 import BeautifulSoup

# References
# https://blog.csdn.net/boheliang99/article/details/122348239

"""
需求:
    使用 python 编写脚本,去抓取 gitee.com/openeuler . gitee.com/src-openeuler 下面所有时间段 pr ,带有 riscv 或者 risc-v (不区分大小写)的关键词,以及所有这些 pr 的贡献邮箱,统计这些邮箱 带有 iscas.ac.cn 后缀的比例

要求:
    1.要生成一个excel表格,表格中的栏位包括 owner,repo,status,created date,owner email ,belong to iscas
    2.将生成的 excel 表格和编写的 python 脚本附在本issue的评论中或者将他们的链接写在评论里

安装包:
    pip install os-sys gitpython openpyxl argparse requests beautifulsoup4

使用方法:
    1. 获取网站的 cookies (重要)
    2. 配置好 ssh 密钥 (哪个用户使用就配置哪个用户的密钥)
    3. 补充 BASE_URL 和 COOKIES
    4. 在脚本目录下执行cg.py (需要管理员运行, 不然有些功能用不了)

过程文件:
    temp_clone: 保存git clone文件的目录
    work_dir: 产生中间文件的目录, 目前有txt文本文件, 用来到处excel
    save: 结果生成文件

结果:
    生成 save 目录, excel 就在其中
"""

usages = """
Usage: python[3] cg.py -s 正则表达式 --url gitee的repo地址 [-d] [-o] [-h] [-s] [-p] [-n]
Options:
    -d,--delete     删除中间生成的文件
    -o,--output     输出中间过程
    -h,--help       展示帮助文档
    -s,--search     匹配内容, default=riscv
    -p,--pages      爬取repo时pull rquests的页数, default=600
    -n,--names      项目总repo的名字, default=openEuler
    --url           repo的路径 default=https://gitee.com/organizations/openeuler/pull_requests

Example:
    command:
    python[3] cg.py
    python[3] cg.py -s riscv --url https://gitee.com/organizations/openeuler/pull_requests

    help:
    python[3] -h
"""

#########################################################
# project
#########################################################
PRJ_NAME = "openEuler"
#########################################################
# base url
#########################################################
BASE_URL = "https://gitee.com"
REPO_URL = f"{BASE_URL}/organizations/openeuler/projects"
TASK_URL = f"{BASE_URL}/organizations/openeuler/issues"
PR_URL = f"{BASE_URL}/organizations/openeuler/pull_requests"
EVENTS_URL = f"{BASE_URL}/organizations/openeuler/events" 
MEMBERS_URL = f"{BASE_URL}/organizations/openeuler/members/list"
TARGET_URL = "" # 搜寻url
#########################################################
# git url
#########################################################
GIT_BASE_URL = "git@gitee.com"
GIT_URL = f"{GIT_BASE_URL}:"
#########################################################
# grep attr
#########################################################
ATTR_URL = "?" # 查询属性

# pull requests 有的元素
ATTR_STATUS = "status=" # [all, open, merged, closed]

# 共有的元素
ATTR_SEARCH = "search=" # 查询内容, 可用正则表达式
ATTR_PAGE = "page=" # [1, 2, 3, ..., n]
#########################################################
# temp file
#########################################################
TEMP_CLONE_DIR = ""
#########################################################
# excel info
#########################################################
EXCEL_WORK_DIR = "./work_dir"
EXCEL_TITLE = "pull requests"
EXCEL_FILE_NAME = ""
EXCEL_WORK_FILE = ""
EXCEL_HEADER_INFO = ['Owner', 'Date', 'URL', 'Email', 'Status', 'belong']
#########################################################
# result
#########################################################
RES_DIR = "./save"
#########################################################
# options parameters
#########################################################
OPTION_PAGE_NUM = 600 # 抓取的 pull_requests 的页数
OPTION_DELETE = False # -d的标志位
OPTION_OUTPUT = False # -o的标志位
START_PAGE = 1
#########################################################
# login's cookies
#########################################################
COOKIES = {
    '_gitee_session': "user_locale=zh-CN; oschina_new_user=false; remove_add_email_guide=gitee-maintain-1714211794; close_wechat_tour=true; slide_id=10; project_pr_sort=closed_at+desc; visit-gitee--2024-06-14=1; remove_bulletin=gitee-maintain-1720417181; visitor_group_pr_sort=closed_at+desc; anonym_group_pr_sort=closed_at+desc; BEC=1f1759df3ccd099821dcf0da6feb0357; remember_user_token=BAhbCFsGaQNfHbZJIiIkMmEkMTAkeTJpU3A4L0lKbnhaYjVqbHBjb2h6LgY6BkVUSSIXMTcyMTEwMDIxOC41NDI2MzY2BjsARg%3D%3D--4ac3c9c7fa77c23f546199abd66eaa9395b636fb; tz=Asia%2FShanghai; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2211935071%22%2C%22first_id%22%3A%22190b9911c1339f-0bad097b425ed08-26001f51-921600-190b9911c141ae%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_utm_source%22%3A%22poper_profile%22%7D%2C%22%24device_id%22%3A%221867317639cdfa-027ec55af3b58d8-26021a51-921600-1867317639dd39%22%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMThmMWVmYmQzZjY5YjctMDA1ZmFjZDgwOThkM2QyYy0yNjAwMWQ1MS05MjE2MDAtMThmMWVmYmQzZjdhYTMiLCIkaWRlbnRpdHlfbG9naW5faWQiOiIxMTkzNTA3MSJ9%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%24identity_login_id%22%2C%22value%22%3A%2211935071%22%7D%7D; sensorsdata2015jssdkchannel=%7B%22prop%22%3A%7B%22_sa_channel_landing_url%22%3A%22%22%2C%22_sa_channel_landing_url_error%22%3A%22url%E7%9A%84domain%E8%A7%A3%E6%9E%90%E5%A4%B1%E8%B4%A5%22%7D%7D; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2211935071%22%2C%22first_id%22%3A%22190b9911c1339f-0bad097b425ed08-26001f51-921600-190b9911c141ae%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_utm_source%22%3A%22poper_profile%22%7D%2C%22%24device_id%22%3A%221867317639cdfa-027ec55af3b58d8-26021a51-921600-1867317639dd39%22%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMThmMWVmYmQzZjY5YjctMDA1ZmFjZDgwOThkM2QyYy0yNjAwMWQ1MS05MjE2MDAtMThmMWVmYmQzZjdhYTMiLCIkaWRlbnRpdHlfbG9naW5faWQiOiIxMTkzNTA3MSJ9%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%24identity_login_id%22%2C%22value%22%3A%2211935071%22%7D%7D; gitee_user=true; Hm_lvt_24f17767262929947cc3631f99bfd274=1720842320,1721018057,1721100210,1721103892; HMACCOUNT=790DF7268294C912; remote_way=ssh; Hm_lvt_6bc840df1e0b2cbbd5d0aab3e06b2610=1721018141,1721101947,1721106095; Hm_lpvt_6bc840df1e0b2cbbd5d0aab3e06b2610=1721106099; Hm_lpvt_24f17767262929947cc3631f99bfd274=1721106302; gitee-session-n=QjliTWpCSlNUeXp1VVd5NG9vaFAwZm9KQXhHb2lMSEdZSmU3aFhMY1JCdEZhR2xkWUpqQjRnQS84M01oOTU0MUppYWNZbXpDL3A3UjAyZndGdjM4RUl5bXdxU2JhdW8xQzNxYVdOaFRwSWh4NmRWNXBUOFZ2WDFHb3dZYXFiUjJqNC9JV0Zuc1l0ODZkRUcxTWRqdnlaUys5VlA1S1FVNCtibFdiZU1jUWlGMFc4UC9HSkYvMThXYzE1bS8zdlZjNWFRVGt1b2p6MHJNOGhubm9Kaks3K0JWMGptWkVZVjErZ2pTeDYvNnVqeDhvM1JmZHVuMDF3VjRmY04xdVVOWm12K2RiNm9pcmRZdFZ5TEZ2a2h0MnpGckkzRllLbys2MU94SXovYWlvUm9WT0w1bVB6VlBIRE83YzVjMjdOclNneHY4RlBMaXF4cmdjcVo2dGNLeXRRTFROaUMzMjhVRm8zaFhzZkVXMGJRPS0tYnJ6YWNMT2FQRmJQbDladFR6UDZ2UT09--ef86c229d2266d01c67f2a425845f0715a53b870"
}
#########################################################
# elements
#########################################################
elements = ['name', 'date', 'repo', 'status', 'email', 'belong']
#########################################################
# notices
#########################################################
WARNING = '[WARNINGS]'
ERROR = '[ERROR]'
INFO = '[INFO]'

# 缺少登录认证
def create_session(cookies):
    # create a session object
    session = requests.Session()
    session.cookies.update(cookies)
    return session

def fetch_html(session, url):
    """
    拉取 url 的 html,使用指定的会话对象

    Parameters:
    - session: requests.Session() 对象,包含登录状态的会话
    - url: 要拉取的网页 URL

    Returns:
    - 如果请求成功,则返回网页内容的文本;如果请求失败,则返回 None
    """
    try:
        # 发送 GET 请求,包含指定的 Cookie
        response = session.get(url)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"{ERROR}Error fetching {url}: {e}")
        return None

def parse_html_sel(html, attr, val, ele=''):
    """
    拿取 html 的内容

    Parameters:
    - html: html 对象
    - ele: html 元素
    - attr: ele 的属性
    - val: attr 的内容

    Returns:
    - 筛选出后的 html
    - 筛选出后的列表

    """
    soup = BeautifulSoup(html, 'html.parser')
    panel_lists = soup.select(f'{ele}[{attr}={val}]')
    return '\n'.join([str(tag) for tag in panel_lists]), panel_lists

def parse_html_ctx(html, ele, attr=None, val=None):
    """
    解析 html 的信息并将文本保存到列表中

    Parameters:
    - html: html 对象
    - text_list: 保存文本的列表
    - ele: html 元素
    - attr: ele 的属性
    - val: attr 的内容

    Returns:
    - 筛选出后的 html
    - 筛选出后的列表

    状态机:
    - PARSE_GET_CONTENT: 提取元素内容
    - PARSE_GET_ATTR: 提取元素属性值
    """
    PARSE_GET_ALL = 1
    PARSE_GET_ATTR = 2
    PARSE_GET_CONTENT = 3

    soup = BeautifulSoup(html, 'html.parser')
    t_text_lists = []
    text_lists = []
    elements = []

    parse_state = None

    # 提取所有元素的内容
    if attr == None:
        elements = soup.find_all(ele)
        parse_state = PARSE_GET_ALL

    # 提取元素属性值
    elif val == None:
        elements = soup.find_all(ele)
        parse_state = PARSE_GET_ATTR

    # 提取元素内容
    elif val != None:
        if attr == "class_":
            elements = soup.find_all(ele, class_=val)
        parse_state = PARSE_GET_CONTENT

    for element in elements:
        text = None
        if parse_state == PARSE_GET_ALL:
            pass
        elif parse_state == PARSE_GET_ATTR:
            text = element.get(attr)
            text = text.strip()
        elif parse_state == PARSE_GET_CONTENT:
            text = element.text.strip()

        t_text_lists.append(text)  # 将文本添加到列表中

    if text_lists != None:
        text_lists.extend(t_text_lists)

    return '\n'.join(t_text_lists), text_lists

def build_base_url(*args):
    return "/".join(filter(None, args))

def build_attr_url(base_url, *args):
    """
    构建 URL 函数

    Parameters:
    - base_url: 基础 URL
    - *args: 可变长度的参数列表,每个参数应该是形如 "key=value" 的字符串

    Returns:
    - 构建完成的 URL 字符串
    """

    query_params = "&".join(filter(None, args))  # 将所有参数拼接为一个字符串 
    full_url = f"{base_url}?{query_params}"  # 构建完整的 URL

    return full_url

def extract_url_name(url, num):
    """
    提取 '/' 分割的字段名

    Parameters:
    - url: url 字串
    - num: 提取的 url 字段

    """
    parts = url.split('/')
    field = parts[num]
    return field

# 选择 subprocess 是因为在 git clone 的期间可以主动打断
# def git_clone(repo_url, clone_to_path):
#     """
#     使用 git clone 命令克隆仓库,并检查克隆是否成功

#     Parameters:
#     - repo_url: 要克隆的仓库 URL
#     - clone_to_path: 克隆到的本地路径
#     """
#     try:
#         # 检查目标路径是否已存在,防止冲突
#         if os.path.exists(clone_to_path):
#             print(f"{WARNING}Path {clone_to_path} already exists.")
#             return True

#         print(f"Cloning repository {repo_url} to {clone_to_path}...")

#         # 使用 subprocess 执行 git clone 命令
#         result = subprocess.run(['git', 'clone', repo_url, clone_to_path], check=True)

#         print(f"Successfully cloned to {clone_to_path}")
#     except subprocess.CalledProcessError as e:
#         print(f"{WARNING}Clone failed: {e}")
#         return False
#     except KeyboardInterrupt:
#         print(F"{ERROR}Clone operation interrupted by user")
#         sys.exit(1)
#     return True

class CloneException(Exception):
    pass

def git_clone(repo_url, clone_to_path, try_times=0):

    # 获取当前工作目录
    cur_path = os.getcwd()
    while try_times >= 0:
        try:
            return_code = 1
            # 创建目标目录(如果不存在)
            clone_path = clone_to_path
            clone_file_path = f"{clone_path}/{repo_url.split('/')[1][:-4]}"

            # clone 的目录是否存在
            if not os.path.exists(clone_path):
                os.makedirs(clone_path)

            # clone 的目标文件目录是否存在
            if os.path.exists(clone_file_path):
                return_code = 0

            # 切换到目标目录
            os.chdir(clone_path)

            # 执行 git clone 命令并检查返回状态码
            if return_code != 0:
                clone_command = f"git clone {repo_url}"
                return_code = os.system(clone_command)

            # 切换回原来的工作目录
            os.chdir(cur_path)

            # 检查返回状态码以判断是否克隆成功
            if return_code == 0:
                print(f"{INFO}Successfully cloned repo: \'{repo_url.split(':')[1][:-4]}\' to \'{clone_path}\'")
                break
            else:
                # 检查是否是因为文件存在导致的
                if os.path.exists(clone_file_path):
                    print(f"{WARNING}Failed to clone \'{repo_url.split(':')[1][:-4]}\' to \'{clone_path}\'. The directory already exists.")
                    try_times = -1
                else:
                    try_times -= 1
                    if try_times >= 0:
                        print(f"{WARNING}Clone failed because of some reason, I will try it again.")
                    else:
                        raise CloneException("Clone failed because of some reason, please rerun the program.")
        except CloneException as e:
            print(f"{ERROR}{e}")
        except Exception as e:
            print(f"Unexpected error: {e}")
            break

def create_directory(directory_name):
    """
    创建目录

    Parameters:
    - directory_name: 目录的地址/名称

    """
    try:
        # 使用 os.makedirs 创建目录
        os.makedirs(directory_name)
        print(f"Successfully created directory: \'{directory_name}\'")
    except FileExistsError:
        print(f"{WARNING}Directory \'{directory_name}\' already exists")
        return False

    return True

def git_log(repo_path):
    """
    在指定的路径执行 git log

    """
    repo = git.Repo(repo_path)
    log = repo.git.log()
    return log

def extract_git_log(repo_path, pattern_line, pattern_word=None):
    """
    从 git log 的输出中提取关键字信息, pattern_word 作为第二个返回值进一步在 pattern_line 中匹配 pattern_word

    Parameters:
    - repo_path: 仓库的本地路径
    - pattern_line: 正则表达式, 匹配的行
    - pattern_word: 正则表达式, 对应行匹配的关键字 (不传参默认只匹配行,第二个返回值没用)

    Returns:
    - list[logs]: 匹配到的所有 pattern_line 的集合
    - key: 在所有 pattern_line 匹配关键字 pattern_word 的内容 (匹配第一项就返回)
    """

    log_output = git_log(repo_path)
    logs = set()
    key = None
    if log_output:
        lines = log_output.splitlines()
        for line in lines:
            # 不区分大小写
            if re.search(pattern_line, line, flags=re.IGNORECASE):
                logs.add(line.strip())
    else:
        return []

    if pattern_word != None:
        for log in logs:
            match = re.search(pattern_word, log, flags=re.IGNORECASE)
            if match:
                key = match.group(1)
                break

    return list[logs], key

def get_info(session):

    """
    获取 url 的信息

    """

    global OPTION_PAGE_NUM
    lists = {ele: [] for ele in elements}

    # 1. 爬取信息
    # 遍历 pull_requests 的每一页
    # 取得 name, date, repo

    # 先查清 OPTION_PAGE_NUM 的上限
    for i in range(START_PAGE, OPTION_PAGE_NUM + 1):
        ATTR_PAGE = f"page={i}"
        tr_url = build_attr_url(TARGET_URL, ATTR_PAGE, ATTR_SEARCH, ATTR_STATUS)
        tr_html = fetch_html(session, tr_url)

        if tr_html:
            panel_lists = parse_html_sel(tr_html, 'class', 'panel-list')[1]
            if not panel_lists:
                OPTION_PAGE_NUM = i - 1
                break

    for pg_num in range(START_PAGE, OPTION_PAGE_NUM + 1):
        ATTR_PAGE = f"page={pg_num}"
        print(f"------------------------------------------start page {pg_num}'s action------------------------------------------")

        tr_url = build_attr_url(TARGET_URL, ATTR_PAGE, ATTR_SEARCH, ATTR_STATUS)
        tr_html = fetch_html(session, tr_url)

        if tr_html:
            exec_num = 1
            panel_lists = parse_html_sel(tr_html, 'class', 'panel-list')[1]

            if not panel_lists:
                break

            loads_num = len(panel_lists)
            # 以仓库为单位进行搜查
            for panel_list in panel_lists:
                t_lists = {ele: [] for ele in elements}

                t_lists['name'] = parse_html_ctx(str(panel_list), "span", "class_", "author")[1]
                t_lists['date'] = parse_html_ctx(str(panel_list), "span", "class_", "timeago latest-timeago")[1]
                t_lists['repo'] = parse_html_ctx(str(panel_list), "check-runs", "pull-request-check-path")[1]

                # 取得 pr 的 state, email
                # 以仓库为单位进行搜查,该循环下每个仓库内容一致,因此只需要 git clone 一次
                if_need_clone = True
                for repo_url, t_name in zip (t_lists['repo'], t_lists['name']):
                    repo_url = repo_url.split('?')[0]
                    pull_url = build_base_url(BASE_URL, repo_url[1:])
                    git_url = build_base_url(GIT_URL, f'{extract_url_name(repo_url, -3)}.git')
                    pull_html = fetch_html(create_session(COOKIES), pull_url)
                    repo_name = extract_url_name(repo_url, -3)
                    repo_path = build_base_url(TEMP_CLONE_DIR, repo_name)

                    if if_need_clone == True:
                        if_need_clone = False
                        git_clone(git_url, TEMP_CLONE_DIR, 2)

                    # clone 之后通过 git log 抓取邮箱信息
                    # 以 Author 开头, iscas.ac.cn 结尾
                    t_email = extract_git_log(repo_path, f"^Author:\s*{t_name}\s*.*iscas\.ac\.cn>$", "<(.*?)>")[1]
                    if t_email != None:
                        t_lists['email'].append(t_email)
                        t_lists['belong'].append("belong to iscas")
                    else:
                        # 直接匹配 git log 相应的邮箱
                        t_email = extract_git_log(repo_path, f"^Author:\s*{t_name}\s*<(.*?)>$", "<(.*?)>")[1]
                        if t_email:
                            t_lists['email'].append(t_email)
                            t_lists['belong'].append("unkown")
                        else:
                            # 实在匹配不到
                            t_lists['email'].append("unkown")
                            t_lists['belong'].append("unkown")

                    t_lists['status'] = parse_html_ctx(pull_html, "div", "class_", "labels-group d-inline-flex mr-05")[1]

                    # 追加到列表中
                    lists['status'].extend(t_lists['status'])
                    lists['belong'].extend(t_lists['belong'])
                    t_lists['status'].clear()
                    t_lists['belong'].clear()

                # 追加到列表中
                lists['name'].extend(t_lists['name'])
                lists['date'].extend(t_lists['date'])
                lists['repo'].extend(t_lists['repo'])
                lists['email'].extend(t_lists['email'])

                # 打印信息
                print(f'------------------------finish pages {pg_num}\'s ({exec_num}/{loads_num}), reminding {OPTION_PAGE_NUM - pg_num} pages have not start to do------------------------')
                exec_num += 1

                if OPTION_OUTPUT:
                    for name, date, repo_url, email, state, belong in zip(lists['name'], lists['date'], lists['repo'], lists['email'], lists['status'], lists['belong']):
                        repo_url = repo_url.split('?')[0]
                        print(f"{name}: {date}, https://gitee.com{repo_url}, {email}, {state}, {belong}")

    # 2. 将信息输出到文件中
    if os.path.exists(f'{EXCEL_WORK_DIR}/{EXCEL_WORK_FILE}'):
        os.remove(f'{EXCEL_WORK_DIR}/{EXCEL_WORK_FILE}')
    create_directory(EXCEL_WORK_DIR)
    output_file_path = f'{EXCEL_WORK_DIR}/{EXCEL_WORK_FILE}.txt'
    with open(output_file_path, 'a', encoding='utf-8') as file:
        for name, date, repo_url, email, state, belong in zip(lists['name'], lists['date'], lists['repo'], lists['email'], lists['status'], lists['belong']):
            repo_url = repo_url.split('?')[0]
            line = f"{name}, {date}, https://gitee.com{repo_url}, {email}, {state}, {belong}\n"
            file.write(line)

    lists_info = []
    try:
        with open(output_file_path, 'r', encoding='utf-8') as file:
            lists_info = file.readlines()
    except FileNotFoundError:
        print(f"File '{output_file_path}' not found.")
        return

    # 3. 导入 excel
    wb = Workbook()
    ws = wb.active
    ws.title = EXCEL_TITLE

    ws.append(EXCEL_HEADER_INFO)

    for line in lists_info:
        parts = line.split(', ')
        ws.append(parts[:6])

    create_directory(RES_DIR)
    excel_file = f'{RES_DIR}/{EXCEL_FILE_NAME}.xlsx'
    if os.path.exists(excel_file):
        os.remove(excel_file)
    wb.save(excel_file)

    # 4. 清理文件
    if OPTION_DELETE:
        os.remove(EXCEL_WORK_DIR)
        # os.remove(TEMP_CLONE_DIR)  

def main():
    global ATTR_SEARCH
    global ATTR_STATUS
    global ATTR_PAGE
    global TARGET_URL
    global OPTION_DELETE
    global OPTION_OUTPUT
    global OPTION_PAGE_NUM
    global EXCEL_FILE_NAME
    global EXCEL_WORK_FILE
    global TEMP_CLONE_DIR
    global GIT_URL
    global PRJ_NAME
    global START_PAGE

    parser = argparse.ArgumentParser(description='统计gitee repo相关数据信息的的脚本', formatter_class=argparse.RawTextHelpFormatter)

    parser.add_argument('-s', '--search', default='riscv', help='匹配内容, default=riscv')
    parser.add_argument('-p', '--pages', default=OPTION_PAGE_NUM, type=int, help='爬取repo时pull rquests的页数, default=1')
    parser.add_argument('--url', default=PR_URL, help='repo的路径, default=https://gitee.com/organizations/openeuler/pull_requests')
    parser.add_argument('-d', '--delete', action='store_true', help='删除中间生成的文件')
    parser.add_argument('-o', '--output', action='store_true', help='输出中间过程')
    parser.add_argument('-f', '--from_', type=int, default=START_PAGE, help='输出中间过程')
    parser.add_argument('-n', '--name', default=EXCEL_FILE_NAME, help='项目总repo的名字')

    args = parser.parse_args()
    if '-h' in vars(args).values():
        parser.print_help()

    if args.delete:
        OPTION_DELETE = True
    if args.output:
        OPTION_OUTPUT = True

    # 初始化
    ATTR_SEARCH += args.search
    TARGET_URL = args.url
    OPTION_PAGE_NUM = args.pages
    PRJ_NAME = f'{args.name}'
    EXCEL_FILE_NAME = f"{PRJ_NAME}-{args.search}"
    EXCEL_WORK_FILE = EXCEL_FILE_NAME
    TEMP_CLONE_DIR = f"./temp_clone_{PRJ_NAME}"
    GIT_URL += TARGET_URL.split('/')[-2]
    START_PAGE = args.from_
    ATTR_STATUS += "all"

    session = create_session(COOKIES)
    get_info(session)

if __name__ == "__main__":
    main()

Cookies 添加方法

  1. 登录www.gitee.com

    image

  2. 获取 Cookies

    • 随便抓取一个cookie

      image

    • 将大段的cookie内容复制粘贴到代码中的COOKIES处, 开始使用

      image

Jvlegod commented 1 month ago

一、excel 表格文件

以下统计内容检索的关键字不区分大小写 openEuler-all.xlsx 最终统计 openEuler-riscv.xlsx openEuler的riscv关键字统计 openEuler-risc-v.xlsx openEuler的risc-v关键字统计 src-openEuler-riscv.xlsx src-openEuler的riscv关键字 src-openEuler-risc-v.xlsx src-openEuler的risc-v关键字统计

二、数据展示

title iscas所占人数 总体人数 iscas所占比率 iscas所占比率(去除robot)
openEuler-riscv 129 357 36.13% 36.86%
openEuler-risc-v 49 110 44.55% 44.95%
src-openEuler-riscv 384 707 54.31% 84.03%
src-openEuler-risc-v 24 73 32.88% 42.86%
openEuler-all 586 1247 46.93% 60.29%

三、结果说明

  1. 由于本地git和gitee信息不一致,部分人员进行了手动筛选,可能还有少数人员信息无法核对找出belong to iscas, 更完整的结果,数据量只会多不会少
  2. 使用过程中需要clone仓库,clone的信息都会被保存到一个temp文件中,使用的过程中可能由于网络原因git clone失败从而导致需要重新启动脚本,我已经尝试过在clone失败之后重复clone过几次,但是依旧可能clone失败,clone失败的软件可以手动在temp文件下进行clone或者重新启动脚本clone,脚本可以指定clone的起始页数,因此不必要从第一面重新开始clone(节省时间),clone完毕之后只要不删除temp中的信息,已经clone过的仓库被检测已经clone过就不会重新clone,目前的脚本不够完善,后续有需要或者有时间会完善和丰富脚本

    四、本次统计脚本的使用

    
    python3 cg.py --url https://gitee.com/organizations/openeuler/pull_requests -n openEuler -s riscv

python3 cg.py --url https://gitee.com/organizations/openeuler/pull_requests -n openEuler -s riscv-v

python3 cg.py --url https://gitee.com/organizations/src-openeuler/pull_requests -n src-openEuler -s riscv

python3 cg.py --url https://gitee.com/organizations/src-openeuler/pull_requests -n src-openEuler -s riscv-v

Jingwiw commented 1 month ago

看起来统计结果中 总体人数需要 去除robot 相关,最终只显示 去除 robot 的结果。 riscv risc-v 的统计结果应该放在一起

Jingwiw commented 1 month ago

所占人数的定义是什么,实际计算的是由软件所贡献的 pr 数量还是,软件所贡献的人数? 统计数据同时需要 贡献的 pr 数量 和 贡献人数

Jvlegod commented 1 month ago

老师按照要求已经重新整理,您看可以不

统计

  1. 整体
|title|统计| -- | -- robot关键字所占数目/个 | 3 robot关键字所占pr数/个| 275 iscas所占人数/个| 43 iscas所占pr数/个| 598 总体贡献人数/个| 155 总体pr数/个| 1247 iscas所占贡献人数比率(去除robot)| 28.29% iscas所占pr比率(去除robot) | 61.52%
  1. openEuler
|title|统计| -- | -- robot关键字所占数目/个| 3 robot关键字所占pr数/个| 8 iscas所占人数/个| 31 iscas所占pr数/个| 187 总体贡献人数/个| 119 总体pr数/个| 467 iscas所占贡献人数比率(去除robot) | 26.72% iscas所占pr比率(去除robot) | 40.74%
  1. src-openEuler
|title|统计| -- | -- robot关键字所占数目/个| 2 robot关键字所占pr数/个| 267 iscas所占人数/个| 30 iscas所占pr数/个| 408 总体贡献人数/个| 71 总体pr数/个| 780 iscas所占贡献人数比率(去除robot) | 43.48% iscas所占pr比率(去除robot) | 79.53%

excel

openEuler-all.xlsx 全部 src-openEuler.xlsx src-openEuler src-openEuler-riscv.xlsx src-openEuler-riscv src-openEuler-risc-v.xlsx src-openEuler-risc-v openEuler.xlsx openEuler openEuler-riscv.xlsx openEuler-riscv openEuler-risc-v.xlsx openEuler-risc-v

Jingwiw commented 1 month ago

嗯嗯,这个是总体的,还是希望多个统计结果都展示一下

Jvlegod commented 1 month ago

嗯嗯,这个是总体的,还是希望多个统计结果都展示一下

老师已经更新了,您看可以吗