mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
17.7k stars 1.3k forks source link

[sdk] ci pipeline for publishing python/node sdk #194

Closed nickscamara closed 4 months ago

mattjoyce commented 5 months ago

I'm interested in this topic, it's an area I want to learn more about. could you share the current process?

nickscamara commented 5 months ago

@mattjoyce All we are doing right now is cleanup of the old build files, then we run:

  1. Build: python setup.py sdist bdist_wheel

  2. twine upload dist/*

Authenticate the above with token

nickscamara commented 5 months ago

For the token I will insert that in github envs as PYPI_AUTH_TOKEN

mattjoyce commented 4 months ago

I have been looking at a few popular python packages, there quite a bit of diversity. However some themes emerge.

While that's not specific to the build/publish process, I thought relevant.

For the build/publish process. Seems there is a transition away from setup.py, to describing the build process in pyproject.toml. The build process seems to fall into three main camps. SetupTools, Flit, Poetry

I think for your project, (which is simple on the python side), the simplest approach would be to use setuptools+pyproject.toml, and potentially continue to maintaining the setup.py for a period.

Here's the generic process differences. Step setup.py Process setuptools + pyproject.toml Process
1. Update version number in setup.py pyproject.toml
2. Define metadata and dependencies setup.py pyproject.toml
3. Specify build system requirements N/A (implicitly uses setuptools if not specified) pyproject.toml under [build-system]
4. Build source distribution python setup.py sdist python -m build
5. Build wheel distribution python setup.py bdist_wheel python -m build
6. Check package twine check dist/* twine check dist/*
7. Publish to Test PyPI twine upload --repository-url https://test.pypi.org/legacy/ dist/* twine upload --repository-url https://test.pypi.org/legacy/ dist/*
8. Publish to PyPI twine upload dist/* twine upload dist/*
9. Install package locally pip install . pip install .
10. CI/CD Configuration Typically scripts using setup.py Scripts using pyproject.toml and setuptools.build_meta

Noting, I'm just reading documents, and have no experience of this, it seems that implementing pre-commit and pyproject.toml would be good precursory steps, before automating the build\publish workflows (which I am now reading about!).

mattjoyce commented 4 months ago

Reading

pre-commit Introduction PyPi Packaging Projects

SetupTools - User Guide - PyProject config

mattjoyce commented 4 months ago

this also looks looks very manageable. Flit - Stable

This does not do the Build, it's just the Publish, but Firecrawl python-sdk is just a couple of files...

mattjoyce commented 4 months ago

OK, I was able to put together a pyproject.toml, and tweak setup.py to pull version from init.py then use Python -m build works fine.

This way build code does not need to be edited often, just update package code.

mattjoyce commented 4 months ago

Here's the pull request for the first step. Python-SDK transitional build setup for pyproject.toml #196

mattjoyce commented 4 months ago

So, then I think an approach could be.

Code is written, commits made, merges accepted, etc Maintainer want to publish

Workflow triggered checks - lint, formatter, isort pytest build publish

rafaelsideguide commented 4 months ago

Hey @mattjoyce ! I took a look at your comments here, and I think we can do something like this: on pull requests on the main branch, we can trigger a GitHub Action that checks if the version in __init__.py (for the Python SDK) or package.json (for the JS SDK) is higher than the version on PyPI (and npm for JavaScript). If it is, we'll run the tests, and if they pass, we'll publish the packages.

For checking the versions, we can use the pip list command and parse the output for the Python SDK, and for the JS SDK, we can use npm list or npm show <package_name> version to get the current published version. We might need to write a small script to automate this comparison. I've already implemented tests for the SDKs on the tests-sdks branch.

What do you think?

mattjoyce commented 4 months ago

@rafaelsideguide

That makes sense but there are couples thing to consider, and I'm not sure what the consequences are.

  1. change version in init, check against current published versions, trigger workflow - tests fail for some reason. What happens then, resolve the cause of the failed tests, and then manually trigger the workload, or revert the version change and commit it again?

  2. the onus will be on the code review to always check that version in init.py is not in pull requests. the decision to publish should be an active decision by your dev\maintainer team. if the version is incremented by a contributor and overlooked, pull request is accepted, it will just publish. My background is corporate IT, and somewhat risk adverse, so my instinct is to put a manual step in the process that cannot happen by accident - I thought the tagging might good for that, but actually not sure who can tag.

I think those scenarios need to be thought through, but your idea seems technically feasible. I need to read up on git hub actions...

mattjoyce commented 4 months ago

put together a script to check versions...

(firecrawl) W:\firecrawl\apps\python-sdk>python check_version_has_incremented.py js ../js-sdk/firecrawl  @mendable/firecrawl-js 
Local version: 0.0.22
Published version: 0.0.21
true

(firecrawl) W:\firecrawl\apps\python-sdk>python check_version_has_incremented.py python ./firecrawl firecrawl-py     
Local version: 0.0.11
Published version: 0.0.11
false
rafaelsideguide commented 4 months ago

@rafaelsideguide

That makes sense but there are couples thing to consider, and I'm not sure what the consequences are.

  1. change version in init, check against current published versions, trigger workflow - tests fail for some reason. What happens then, resolve the cause of the failed tests, and then manually trigger the workload, or revert the version change and commit it again?
  2. the onus will be on the code review to always check that version in init.py is not in pull requests. the decision to publish should be an active decision by your dev\maintainer team. if the version is incremented by a contributor and overlooked, pull request is accepted, it will just publish. My background is corporate IT, and somewhat risk adverse, so my instinct is to put a manual step in the process that cannot happen by accident - I thought the tagging might good for that, but actually not sure who can tag.

I think those scenarios need to be thought through, but your idea seems technically feasible. I need to read up on git hub actions...

@mattjoyce

  1. For the scenario with the tests failing after a version change: I think the developer should just fix whatever caused the failure and then bump up the version. Like, if I tried to publish version 0.0.54 and it didn't go through because of test issues, I'd fix what is causing the issues and move on to 0.0.55. Sounds good?

  2. Regarding the version numbers in init.py during code reviews: So far, we've been pretty thorough with checking everything in a pull request before anything gets merged. I think as long as we keep our reviews solid and all updates pass the tests, we should be in the clear. Let's keep things moving fast.

What do you think?

mattjoyce commented 4 months ago

yeah, sounds like a plan! Also:

  1. Tests are applied on merge requests, so maintainers should see fails there.
  2. the check verions test could be used to throw a warning if merge requests have changed version in init.py.

Do you have a branch for this stuff, I will up my test script to... /.github/workflows/scripts Need this merged in : https://github.com/mendableai/firecrawl/pull/196

mattjoyce commented 4 months ago

Here you go : https://github.com/mendableai/firecrawl/pull/213

mattjoyce commented 4 months ago

ok, here is a draft, or example of how the script would be used to - completely theoretical.

name: CI/CD

on:
  push:
    branches:
      - main

jobs:
  build-and-publish:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.x'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install setuptools wheel twine build requests packaging

      - name: Check if version is incremented
        id: check_version
        run: |
          VERSION_INCREMENTED=$(python .github/scripts/check_version.py python your_package your-python-package-name)
          echo "VERSION_INCREMENTED=$VERSION_INCREMENTED" >> $GITHUB_ENV

      - name: Build the package
        if: env.VERSION_INCREMENTED == 'true'
        run: |
          python -m build

      - name: Publish to PyPI
        if: env.VERSION_INCREMENTED == 'true'
        env:
          TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
          TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
        run: |
          twine upload dist/*
rafaelsideguide commented 4 months ago

hey @mattjoyce this is looking good! If all goes well, we're going to test it today

rafaelsideguide commented 4 months ago

PyPi is not publishing for some reason 🫠. I'm gonna check it out again tomorrow morning.

rafaelsideguide commented 4 months ago

Fixed workflow