openteamsinc / Score

BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Generate pypi website scraper dataset #13

Open srossross opened 1 month ago

srossross commented 1 month ago

Scrape pypi for all the info needed in the research phase that can not be gotten from pypi json api

the dataset output format and location should be documented

This should be a new command like python -m score.cli pypi-web

karamba228 commented 1 month ago

Fields unique to the PyPi Webscraper:

All fields gathered by the PyPi Webscraper:

All the fields are type string except for releases

To run the scraper you can run:

python -m score.cli scrape-pypi-web --letter 0-9

Where scrape-pypi-web is a call to the web scraper and --letter (optional) specifies the range of letters that you would like to scrape

The output destination would be as follows: ./score/output/web/letter={letter}/pypi_packages.parquet