irthomasthomas / undecidability

1 stars 0 forks source link

PeaTMOSS-Demos - database of real-wold uses of Pre-Trained Models. #754

Open irthomasthomas opened 2 months ago

irthomasthomas commented 2 months ago

PeaTMOSS-Demos

This repository contains information about the Pre-Trained Models in Open-Source Software (PeaTMOSS) dataset.

Table of Contents

About

This repository contains a zipped sample of the PeaTMOSS dataset, as well as a script that demonstrates possible interactions with the SQLite database used to store the metadata dataset. The complete PeaTMOSS dataset contains snapshots of Pre-Trained machine learning Model (PTM) repositories and the downstream Open-Source GitHub repositories that reuse the PTMs, metadata about the PTMs, the pull requests and issues of the GitHub Repositories, and links between the downstream GitHub repositories and the PTM models. The schema of the SQLite database is specified by PeaTMOSS.py and PeatMOSS.sql. The sample of the database is PeaTMOSS_sample.db. The full database, as well as all captured repository snapshots are available here.

- Note: When unzipping .tar.gz snapshots, include the flag

--strip-components=4

in the tar statement, like so

tar --strip-components=4 -xvzf {name}.tar.gz

If you do not do this, you will have 4 extraneous parent directories that encase the repository.

Globus

Globus Share

All zipped repos and the full metadata dataset are available through Globus Share.

If you do not have an account, follow the Globus docs on how to sign up. You may create an account through a partnered organization if you are a part of that organization, or through Google or ORCID accounts.

Globus Connect Personal

To access the metadata dataset using the globus.py script provided in the repository:

  1. Download Globus Connect Personal
  2. Create your own private Globus collection on Mac, Windows, or Linux
  3. Once this is created, make sure your Globus Personal Connect is running before executing globus.py

NOTE: In some cases, you may run into permission issues on Globus when running the script. If this is the case, you will need to change local_endpoint.endpoint_id, located on line 29, to your private collection's UUID:

local_endpoint_id = local_endpoint.endpoint_id

To locate your private collection's UUID, click on the Globus icon on your taskbar and select "Web: Collection Details". On this page, scroll down to the bottom where the UUID field for your collection should be visible, and replace the variable with your collection's UUID expressed as a string. Then, use the activities tab to terminate the existing transfer and rerun globus.py.

Metadata Description

The following model hubs are captured in our database:

The content for each specific model hub is listed in the table below:

Model hub #PTMs #Snapshotted Repos #Discussions (PRs, issues) #Links Size of Zipped Snapshots
Hugging Face 281,276 14,899 59,011 30,514 44TB
PyTorch Hub 362 361 52,161 13,823 1.3GB
We also offer two different formats of our datasets to facilitate the mining challenge for participants. An overview of these two formats can be found in the table below: Formats Description Size
Metadata It contains only the metadata of the PTM packagesr and a subset of the GitHub project metadata. 7.12GB
Full It contains all metadata, adding the PTM package contents in each published version, and git history of the main branhes of the GitHub projects. 48.2TB

Dependencies

The scripts in the project depend upon the following software:

Package dependencies are given in environment.yml and handled by anaconda

How To Install

To run the scripts in this project, you must install python 3.11 and SQLAlchemy v2.0 or greater.

These package can be installed using the anaconda environment manager

  1. Install the latest version of anaconda from here
  2. run conda env create -f environment.yml to create the anaconda environment PeaTMOSS
  3. Activate the environment using conda activate PeaTMOSS

Alternatively, you can navigate to each packages respective pages and install them.

How to Run

After installing the anaconda environment, each demo script can be run using python3 script_name.py

Tutorial

This section will explain how to use SQL and SQLAlchemy to interact with the database to answer the research questions outlined in the proposal.

Using SQL to query the database

One option users have to interact with the metadata dataset is to use plain SQL. The metadata dataset is stored in a SQLite database file called PeaTMOSS.db, which can be found in the Globus Share. This file can be queried through standard SQL queries, and this can be done from a terminal using sqlite3: SQLite CLI. Single queries can be executed like

$ sqlite3 PeaTMOSS.db '{query statement}'

Alternatively, you can start an SQLite instance by simply executing

$ sqlite3 PeaTMOSS.db

which can be terminated by CTRL + D or .quit. To output queries to files, the .output command can be used

sqlite> .output {filename}.txt

Research Question Example (SQL)

The following example has to do with research question GH2: "What do developers on GitHub discuss related to PTM use, e.g., in issues, and pull requests? What are developers’ sentiments regarding PTM use? Do the people do pull requests of PTMs have the right expertise?"

If someone wants to observe what developers on GitHub are currently discussing related to PTM usage, they can look at discussions in GitHub issues and pull requests. The following SQLite example shows queries that would help accomplish this task.

  1. First, we will create an sqlite3 instance:

    $ sqlite3 PeaTMOSS.db
  2. Then, we will create an output file for our issues query, then execute that query:

    sqlite> .output issues.txt
    sqlite> SELECT id, title FROM github_issue WHERE state = 'OPEN' ORDER BY updated_at DESC LIMIT 100;

    Output:

Issues Query

The above query selects the ID and Title fields from the github_issue table, and chooses the 100 most recent issues that are still open.

  1. Next, we will create an output file for our pull requests query, then execute that query:
    sqlite> .output pull_requests.txt
    sqlite> SELECT id, title FROM github_pull_request WHERE state = 'OPEN' OR state = 'MERGED' ORDER BY updated_at DESC LIMIT 100;

    Output:

Pull Requests Query

Notice that the query is very similar to the issues query, as we are looking for similar information. The above query selects the ID and Title fields from the github_pull_request table, and chooses the 100 most recent pull requests that are either open or merged.

Querying this data can assist when beginning to observe current/recent discussions in GitHub about PTMs. From here, you may adjust these queries to include more/less entries by changing the LIMIT value, or you may adjust which fields the queries return. For example, if you want more detailed information you could select the "body" field in either table.

Using ORMs to query the database

This section will include more details about the demo provided in the repository, PeaTMOSS_demo.py. Once again, this method requires the PeaTMOSS.db file, which can be found in the Globus Share. Prior to running this demo, ensure that the conda environment has been created and activated, or you may run into errors.

The purpose of the demo, as described at by the comment at the top of its file, is to demonstrate how one may use SQLAlchemy to address one of the research questions. The question being addressed in the demo is I1: "It can be difficult to interpret model popularity numbers by download rates. To what extent does a PTM’s download rates correlate with the number of GitHub projects that rely on it, or the popularity of the GitHub projects?". The demo accomplishes this by looking at two main fields: the number of times a model is downloaded from its model hub, and the number of times a model is reused in a GitHub repository. The demo finds the 100 most downloaded models, and finds how many times each of those models are reused. Users can take this information and attempt to find a correlation.

Research Question Example (ORM)

PeaTMOSS_demo.py utilizes PeaTMOSS.py, which is used to describe the structure of the database so that we may interact with it using SQLAlchemy. To begin, you must create and SQLAlchemy engine using the database file

import sqlalchemy
engine = sqlalchemy.create_engine(f"sqlite:///{path}")

where path is a string that describes the filepath to the database file. Both relative and absolute file paths can be used.

To find the 100 most downloaded models, we will query the model table

import sqlalchemy
from sqlalchemy.orm import Session
from PeaTMOSS import *

query_name_downloads = sqlalchemy.select(Model.id, Model.context_id, Model.downloads).limit(100).order_by(sqlalchemy.desc(Model.downloads))

and execute the query

models = session.execute(query_name_downloads).all()

For each of these models, we want to know how many times they are being reused. The model_to_reuse_repository contains fields for model IDs and reuse repository IDs, effectively linking them together. If a model is reused in multiple repository its ID will show up multiple times in the model_to_reuse_repository table. Therefore, we want to see if these highly downloaded models are also highly reused. We can do this querying the model_to_reuse_repository table and only select entries where the model_id field is equivalent to the current model's ID:

for model in models:
    #...
    query_num_reuses = sqlalchemy.select(PeaTMOSS.model_to_reuse_repository.columns.model_id)\
                                  .where(PeaTMOSS.model_to_reuse_repository.columns.model_id == model.id)

This query will select all the instances of the current model's ID appears in the model_to_reuse_repository table. If we execute this query and count the number of elements in the result, we have the number of times that model has been reused:

num_reuses = len(session.execute(query_num_reuses).all())

In each iteration of the loop we can store this information in dictionaries, where the keys can be the names of the models:

for model in models:
    highly_downloaded[model.context_id] = model.downloads
    #...
    #...
    reused_rates[model.context_id] = num_reuses

And then at the end, we can simply print the results. From there, users may observe a level of correlation using a method they see fit.

Download Results:

Download Results

Reuse Results:

Reuse Results

Suggested labels

irthomasthomas commented 2 months ago

Related content

702

Similarity score: 0.9

625

Similarity score: 0.89

325

Similarity score: 0.88

678

Similarity score: 0.88

396

Similarity score: 0.88

644

Similarity score: 0.88