la-famiglia-jst2324 / parma-analytics

ParmaAI analytics repository providing data processing and inference.
3 stars 0 forks source link

parma-analytics

Chore CI Deploy Staging Deploy Prod Major Tag

Core backend for the ParmaAI data mining and analytics platform.

Project Description

ParmaAI is a sophisticated platform combining advanced data mining with comprehensive analytics. At its core, Parma Analytics orchestrates the data lifecycle, scheduling and managing data mining modules, and processing the extracted data using tools like ChatGPT for in-depth analysis and report generation. Complementing this, Parma Web provides a user-friendly interface for data management and visualization, making the system's complex functionalities accessible and intuitive. Together, they form a powerful solution for businesses to monitor, analyze, and respond to corporate anomalies, leveraging data-driven insights for strategic decision-making.

parma-analytics

Parma Analytics is the dynamic engine at the heart of the ParmaAI ecosystem, playing a multifaceted role in data processing, analysis, and orchestration. Beyond analyzing data and generating insightful reports with tools like ChatGPT, it is also tasked with the critical function of scheduling and managing interactions with data mining modules. This repository is key for contributors focused on enhancing our system's capability to not only process and analyze data but also to efficiently coordinate the various stages of data mining. It's a nexus for integrating complex workflows, scheduling tasks, and ensuring seamless communication between different data mining modules.

Related repositories:

parma-ai architecture

The parma analytics backend is the heart of the system connecting the data mining processes with the frontend stack while being responsible for analytics and inference.

system's architecture

The parma ai backend consists of the following components:

systems_architecture

The whole deployment stack is replicated for staging and production environments.

Note: this is a simplified version of the architecture. More detailed process flows can be found below.

Detailed architecture


graph TD

        %% Subgraphs
    subgraph FrontendContainer[FRONTEND]
        Frontend[("Next.js<br>TypeScript")]
    end

    subgraph ApiBackendContainer[API BACKEND]
        ApiRestBackend[("Next.js<br>Node.js<br>TypeScript")]
        AuthUser(("Authentication<br>User Management"))
    end

    subgraph WebLayer[Web Layer]
        FrontendContainer
        ApiBackendContainer
    end

    subgraph DatabaseLayer[Database Layer]
        AnalyticsDatabase[("Analytics Database")]
        ProductionDatabase[("Production Database")]
        CrawlingDatabase[("Crawling Database")]
    end

    subgraph AnalyticsLayer[Analytics Layer]
        AnalyticsBackendContainer
    end

    subgraph AnalyticsBackendContainer[Analytics Backend]
        AnalyticsRestBackend[("Python Backend")]
        NormalizationService
        NotificationContainer
    end

    subgraph NotificationContainer[Notification Module]
        Slack
        Email
    end

    subgraph NormalizationService[Normalization Module]
    end

    subgraph DataSourcingLayer[Data Sourcing Layer]
        DataMiningModules
    end

    subgraph GoogleCloudScheduler[Google Cloud Scheduler]
    end

    subgraph DataMiningModules[Data Mining Modules]
        Module1[("Reddit")]
        Module2[("GitHub")]
        Module3[("Clearbit")]
        ModuleN[("People Data Labs")]
    end

        %% Links
    Users --> FrontendContainer
    FrontendContainer -->|REST| ApiRestBackend
    ApiRestBackend -->|SQL Framework| ProductionDatabase
    ProductionDatabase -->|Replicate| AnalyticsDatabase
    AnalyticsRestBackend -->|SQL Framework| ProductionDatabase
    AnalyticsRestBackend -->|SQL Framework| CrawlingDatabase
    AnalyticsRestBackend --> NotificationContainer
    NotificationContainer --> ExternalNotificationProvider["External Notification Provider"]
    AnalyticsRestBackend -->|Trigger Run| DataMiningModules
    DataMiningModules -->|Notify Finished / Send Raw Data| AnalyticsRestBackend
    GoogleCloudScheduler -->|Trigger Periodically| AnalyticsRestBackend
    AnalyticsRestBackend -->|Raw Data| NormalizationService
    NormalizationService -->|Normalized Data| AnalyticsRestBackend

        %% Link Styles
    linkStyle default stroke: #9E9D91, stroke-width: 2px;

Data flow diagram

graph LR
    subgraph "parma-scheduler (Google Cloud Scheduler)"
        S(Cloud Scheduler)
    end
    subgraph "parma-mining (Serverless / GCP Cloud Run)"
        M1[Data Mining] --> M2[Data Preprocessing]
    end
    subgraph "parma-analytics (Serverless / GCP Cloud Run)"
        subgraph "parma-mining-db (NoSQL/Firebase)"
            FB1[NoSQL Mining Database]
        end
        M2 --> A0[Analytics REST API]
        A0 -->|persist raw data| A1[Raw Data Storage Adapter]
        A1 <--> FB1
        subgraph "data_processing (Python)"
            A2(Normalization)
        end
        A0 -->|handover to data processing| A2
        A2 --> A3["SQL Storage Adapter"]
        subgraph "analytics and reporting (Python)"
            AR1[Data Analytics] --> AR2[Data Inference]
            AR2 --> AR3[Data Visualization]
            AR3 --> AR4[Data Reporting / Alerting]
        end
        AR3 -->|persist reporting results| A1
        A2 -->|trigger analytics run| AR1
        A0 -->|manage sourcing schedules| A4(Sourcing scheduler)
        A3 <--> A4
    end
    S -->|trigger mining run| A0
    A4 -->|trigger mining run| M1
    subgraph "Notifications (Firebase)"
        AR4 -->|send notifications| N1[Slack]
        AR4 -->|send notifications| N2[Gmail]
    end
    subgraph "parma-prod-db (GCP Cloud SQL)"
        P[Postgres SQL]
    end
    A3 <--> P
    subgraph "parma-web (Vercel)"
        W1[Prisma database adapter] <--> W2[REST API]
        W2 <--> W3[Frontend]
    end
    P <--> W1

Process flow diagram: Mining Scheduling

graph TD

        %% Subgraphs
   subgraph FrontendContainer[FRONTEND]
   end

   subgraph ApiBackendContainer[API BACKEND]
   end

   subgraph WebLayer[Web Layer]
      FrontendContainer
      ApiBackendContainer
   end

   subgraph DatabaseLayer["PostgreSQL Database (Production)"]
      TaskConfigurations["data_source"]
      ScheduledTasks["scheduled_tasks"]
   end

   subgraph AnalyticsBackendContainer[Analytics Backend]
      ScheduleEndpoint["/schedule"]
      TriggerDataSourcesEndpoint["MiningModuleManager"]
      CrawlingFinishedEndpoint["/crawling-finished"]
   end

   subgraph DataRetrievalModules[Data Mining Modules]
   end

   subgraph GoogleCloudScheduler[Google Cloud Scheduler]
   end

        %% Links
   Users -->|1. Interacts with frontend| FrontendContainer
   FrontendContainer -->|2. Send Configuration| ApiBackendContainer
   ApiBackendContainer -->|3.a Add/Update Configuration| TaskConfigurations
   ApiBackendContainer -->|"3.b Add a row (on-demand)"| ScheduledTasks
   ApiBackendContainer -->|"3.b Trigger Scheduling (on-demand)"| ScheduleEndpoint
   GoogleCloudScheduler -->|"4. Triggers Scheduling (every x time)"| ScheduleEndpoint
   DatabaseLayer -->|"5.(a/b) Tables"| ScheduleEndpoint
   ScheduleEndpoint -->|5.c Add a Row| ScheduledTasks
   ScheduleEndpoint -->|5.d Trigger| TriggerDataSourcesEndpoint
   TriggerDataSourcesEndpoint -->|6.a Reserve Task| ScheduledTasks
   TriggerDataSourcesEndpoint -->|"6.b Call /companies endpoint"| DataRetrievalModules
   DataRetrievalModules -->|7. Notify Mining is Finished| CrawlingFinishedEndpoint
   CrawlingFinishedEndpoint -->|8. Set Task Status as Success| TriggerDataSourcesEndpoint

        %% Link Styles
   linkStyle default stroke:#9E9D91,stroke-width:2px;

Activity diagram: Mining Scheduling

graph TD
    A[Start] -->|Triggered by Scheduler or On-demand| B["/schedule Endpoint"]
    B --> C[Create Pending Scheduled Tasks for Active Data Sources]
    C --> D[Update Overdue Tasks]
    D --> E{Check Task Attempts}
    E -->|Attempts < 3| F[Set Tasks to Processing and Schedule]
    E -->|Attempts = 3| G[Mark Tasks as Failed]
    F --> H[Trigger Mining Module for Scheduled Tasks]
    G --> I[End]
    H --> J[Retrieve Company Data for Each Scheduled Data Source]
    J --> K{Check Handles for Each Company}
    K -->|Handle Exists and Valid| L[Trigger /companies Endpoint in Mining Module]
    K -->|Handle Missing/Expired| M[Call /discovery Endpoint and Update Handles]
    M --> L
    L --> I[End]

How to add new data sources

ADDING_DATASOURCES.md

Getting Started

The following steps will get you started with the project.

NOTE: Although the general steps should also work on a Windows, we highly recommend to use a Linux based machine for development. WSL is also an option. Use Windows at your own risk.

  1. Pre-requisites: to be able to contribute to JST in this repository, make sure to comply with the following prerequisites.

    • Configure GitHub via an ssh key. Key based authenticated is highly encouraged. See GitHub Docs for more information.
    • Please make sure to have an GPG key configured for GitHub. See GitHub Docs for more information.
    • Install micromamba, a conda environment management package manager, as described here. Alternatively conda or mamba installations should also work, but are highly discouraged because of their slow performance.
    • Install docker and docker-compose. See Docker Docs for more information.
  2. Clone the repository

    git@github.com:la-famiglia-jst2324/parma-analytics.git
  3. Precommit & environment setup:

    # spinning up the database container for local development
    docker-compose up -d
    
    make install
  4. Activating the environment:

    # Activate the new environment (do this every time you start a new terminal)
    # fyi. there are IDE extensions to automatically activate the environment
    micromamba activate parma-analytics
    
    # do the following only once
    pip install -e . # Install the project in editable mode
    pre-commit install
  5. Export environment variables for the database:

    # fyi. there are IDE extensions to automatically load
    # environment variables from a .env file
    export POSTGRES_HOST=localhost
    export POSTGRES_PORT=9000
    export POSTGRES_USER=parma-prod-db
    export POSTGRES_PASSWORD=parma-prod-db
    export POSTGRES_DB=parma-prod-db
  6. Start the api server:

    make dev

    Open http://localhost:8000 with your browser to see the result.

    FastApi will provide you with an interactive documentation of the api. You can also use the swagger ui at http://localhost:8000/docs or the redoc ui at http://localhost:8000/redoc.

  7. Optional: Running the pre-commit pipeline manually

    pre-commit run --all
  8. Test your code:

    make test
  9. Setup the Firestore DB (if necessary):

    If you want to connect to the crawling database, where we save the raw data, you have to get the credentials from the Notion. Then create .secrets folder in the main directory (or use the existing one) and add the credentials in a new file named la-famiglia-parma-ai-firebase-adminsdk.json.

  10. Setup the Firestore Emulator (for local testing):

    Follow the instructions in Notion to setup and use the Firestore Emulator.

Sendgrid

ChatGPT

Get Started with Slack Notifications:

To ensure you're set up to receive messages in your Slack channel, simply follow these steps:

  1. Create Your Slack App: Head over to the Slack messaging documentation and set up a Slack App for your workspace. Pay special attention to the Requesting the necessary permissions section to get your app up and running.

  2. Invite Your Slack Bot: Once your app is ready, bring it into the conversation by inviting it to the desired channel(s):

    • In the message field of your Slack channel, type /invite @YourAppName and hit enter. Be sure to replace YourAppName with the actual name of your app.
  3. Connect to Parma AI:

    • Locate your Bot User OAuth Token on your Slack App's page under the OAuth & Permissions section.
    • Enter this token and your desired Slack channel name(s) into the corresponding fields in your Parma AI settings.

And that's it! You're all set to receive notifications through Parma AI in your selected Slack channel.

PR workflow

  1. Create a new branch linear.app offers a button to copy branch names from tickets. In case there is no ticket, please use feel free to use an arbitrary name or create a ticket. GitHub CI doesn't care about the branch name, only the PR title matters.

    # format: e.g. robinholzingr/meta-1-create-archtecture-drafts-diagrams-list-of-key-priorities
    git checkout -b <branch-name>
  2. Open a PR and use a conventional commit PR title.

  3. Wait for CI pipeline to pass and if you are happy with your changes request a review.

  4. Merge the PR (using the "Squash and merge" option) and delete the branch. Pay attention to include co-authors if anyone else contributed to the PR.

  5. If you want to release a new version to production, create a new release on GitHub. The release version will be automatically derived from the PR titles (breaking changes yield new major versions, new features yield new minor versions).

Directory structure

.
├── parma_analytics: Main package for analytics backend
│   ├── analytics: subpackage for main analytical tasks
│   │   ├── inference: subpackage for custom models and inference
│   │   └── visualization
│   ├── api
│   │   ├── README.md: Guidelines for api design
│   │   ├── main.py: Api entrypoint for fastapi
│   │   ├── models: pydantic models for api
│   │   │   ├── README.md: Guidelines for pydantic models
│   │   │   └── dummy.py: Example pydantic models
│   │   └── routes: api routes for fastapi
│   │       └── dummy.py: Example api routes
│   ├── bl: Business logic for analytics backend
│   ├── db: Database subpackage
│   │   ├── mining: subpackage for mining database
│   │   └── prod: subpackage for production database
│   │       ├── README.md
│   │       ├── dummy.py: Example database connectors
│   │       ├── engine.py: Database engine utility
│   │       ├── models: Database models
│   │       │   ├── README.md: Guidelines for database models
│   │       │   ├── base.py: Base database model
│   │       │   └── dummy.py: Example database models
│   │       └── utils: Database utilities
│   │           └── paginate.py: Pagination utility
│   ├── etl: Data processing subpackage
│   └── reporting: Reporting subpackage
│       ├── gmail: subpackage for gmail reporting
│       └── slack: subpackage for slack reporting
├─ tests:
│   ├── analytics
│   │   ├── inference
│   │   └── visualization
│   ├── api
│   ├── db
│   ├── etl
│   ├── reporting
│   │   ├── gmail
│   │   └── slack
│   └── test_dummy.py
├── Makefile: Recipes for easy simplified setup and local development
├── README.md
├── docker-compose.yml: Docker compose file for local database
├── environment.yml: conda environment file
├── pyproject.toml: Python project configuration file

Tech Stack

Core libraries that this project uses:

Deployment

The deployment of Parma Analytics is managed through a combination of Terraform for infrastructure management and GitHub Actions for continuous integration and delivery. Our deployment strategy ensures that our application is consistently deployed across different environments with high reliability and minimal manual intervention.

Infrastructure as Code with Terraform

We use Terraform for defining, provisioning, and managing the cloud infrastructure required for Parma Analytics. Our Terraform configuration files are organized under the terraform directory, divided into different environments like staging (staging), and production (prod). Each environment has its own set of configurations and variables, ensuring isolation and control over different deployment stages.

A pivotal aspect of our Terraform strategy is the use of a common module, which is housed in the module directory. This module encompasses the core infrastructure components that are shared across all environments. The utilization of a shared module ensures consistency and streamlines our infrastructure management.

Each environment, staging and production, references this common module but with its own set of environment-specific configurations and variables. This structure ensures that each environment, while based on a common foundation, is independently configurable and isolated, thus providing precise control over the deployment in various stages.

Key components of our infrastructure include:

Continuous Deployment with GitHub Actions

Our GitHub Actions workflow, defined in .github/workflows/deploy.yml, automates the deployment process. The workflow is triggered on pushes to the main branch and on published releases. It encompasses steps for:

Deployment Environments

We maintain two primary environments for our application:

Disclaimer

In case there are any issues with the initial setup or important architectural decisions/integrations missing, please contact the meta team or @robinholzi directly.