tech-greedy / singularity

New node software for large-scale clients with PB-scale data onboarding to Filecoin network
Other
38 stars 18 forks source link

singularity

New node software for large-scale clients with PB-scale data onboarding to Filecoin network

build workflow npm version

⛔️ DEPRECATION WARNING

The V1 Singularity is deprecated in favor of Singularity V2.

Check how they are different and development progress

Related Projects

Quick Start

Looking for a complete end-to-end demonstration? Try Getting Started Guide

Prerequisite

# Install nvm (https://github.com/nvm-sh/nvm#install--update-script)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
source ~/.bashrc
# Install node v18
nvm install 18

Install globally from npm

npm i -g @techgreedy/singularity
singularity -h

Build and run from source

1. Transpile this project

git clone https://github.com/tech-greedy/singularity.git
cd singularity
npm ci
npm run build
npm link
singularity -h

2. Build Dependency

By default, npm will pull the pre-built binaries for dependencies. You can choose to build it from source and override the one pulled by npm.

# Make sure you have go v1.17+ installed
git clone https://github.com/tech-greedy/generate-car.git
cd generate-car
make

Then copy the generated binary to override the existing one from the PATH for your node environment, i.e.

Note that the path may change depending on the nodejs version. If you cannot find the folder above, try searching for the generate-car binary first (i.e.m find ~/.nvm -name 'generate-car').

Initialization (Optional)

To use the tool as a daemon, it needs to initialize the config and the database. To do so, run

singularity init

By default, a repository will be initialized at $HOME_DIR/.singularity. Set the environment variable SINGULARITY_PATH to override this behavior.

# Unix
export SINGULARITY_PATH=/the/path/to/the/repo
# Windows
set SINGULARITY_PATH=/the/path/to/the/repo

Topology Choices

Since the tool is modularized, it can be deployed in different ways and have different components enabled or disabled.

Below are configurations for common scenarios.

Deal Preparation Only

This is useful if you only need deal preparation but not deal making. You can still have deal making enabled, but disabling it will use slightly less system resources.
In default.toml from your repo

  1. change ipfs.enabled to false
  2. change deal_tracking_service.enabled to false
  3. change deal_replication_service.enabled to false
  4. change deal_replication_worker.enabled to false

Use External MongoDb database

This is useful if you know MongoDB, and you're hitting some bottlenecks or issues from the built-in MongoDb.

  1. Setup your own MongoDb instance
  2. In default.toml from your repo
    1. change database.start_local to false
    2. change connection.database to the connection string of your own MongoDb database

Running Workers on different node for Deal Preparation

  1. On master server, set deal_preparation_service.enabled, database.start_local to true and disable all other modules
  2. On worker servers, set deal_preparation_worker.enabled to true and disable all other modules. Change connection.database and connection.deal_preparation_service to the IP address of the master server

Usage

$ singularity
Usage: singularity [options] [command]

A tool for large-scale clients with PB-scale data onboarding to Filecoin network
Visit https://github.com/tech-greedy/singularity for more details

Options:
  -V, --version     output the version number
  -h, --help        display help for command

Commands:
  init              Initialize the configuration directory in SINGULARITY_PATH
                    If unset, it will be initialized at HOME_DIR/.singularity
  daemon            Start a daemon process for deal preparation and deal making
  preparation|prep  Manage deal preparation
  help [command]    display help for command

Start the Daemon

export SINGULARITY_PATH=/the/path/to/the/repo
singularity daemon

Deal Preparation

Deal preparation contains two parts

$ singularity prep -h
Usage: singularity preparation|prep [options] [command]

Manage deal preparation

Options:
  -h, --help                                             display help for command

Commands:
  create [options] <datasetName> <datasetPath> <outDir>  Start deal preparation for a local dataset
  status [options] <dataset>                             Check the status of a deal preparation request
  list [options]                                         List all deal preparation requests
  generation-manifest [options] <generationId>           Get the Slingshot v3.x manifest data for a single deal generation request
  generation-status [options] <generationId>             Check the status of a single deal generation request
  pause                                                  Pause scanning or generation requests
  resume                                                 Resume scanning or generation requests
  retry                                                  Retry scanning or generation requests
  remove [options] <dataset>                             Remove all records from database for a dataset
  help [command]                                         display help for command

Create Deal Preparation Request

This will create a scanning request for a dataset. While the dataset is being scanned, it will also produce generation requests to be taken by workers.

$ singularity prep create -h
Usage: singularity preparation create [options] <datasetName> <datasetPath> <outDir>

Start deal preparation for a local dataset

Arguments:
  datasetName                  A unique name of the dataset
  datasetPath                  Directory path to the dataset
  outDir                       The output Directory to save CAR files

Options:
  -s, --deal-size <deal_size>  Target deal size, i.e. 32GiB (default: "32 GiB")
  -t, --tmp-dir <tmp_dir>      Optional temporary directory. May be useful when it is at least 2x faster than the dataset source, such as when the dataset is on network mount, and the I/O is the bottleneck
  -f, --skip-inaccessible-files  Skip inaccessible files. Scanning may take longer to complete.
  -m, --min-ratio <min_ratio>  Min ratio of deal to sector size, i.e. 0.55
  -M, --max-ratio <max_ratio>  Max ratio of deal to sector size, i.e. 0.95
  -h, --help                   display help for command

Support for public S3 bucket

The deal preparation supports public S3 bucket natively. Temporary directory is mandatory when using with S3 bucket. i.e.

singularity prep create -t <tmp_dir> <dataset_name> s3://<bucket_name>/<optional_prefix>/ <out_dir>

Pause/Resume/Retry a request

For each dataset preparation request, it always starts with scanning request, once enough files can be packed into a single deal, it will create a generation request. In other words, each preparation request is a single scanning request and a bunch of generation requests.

You can pause/resume/retry the scanning request or generation requests.

singularity prep pause -h
singularity prep resume -h
singularity prep retry -h

Append more files to a request

Append a new directory to an existing dataset. This will add all entries under the new directory into the dataset. Just like the singularity prep create command, the directory will be considered as the root. User is responsible for making sure there are no duplicate entries in the dataset otherwise the file with same path may be corrupted during retrieval.

singularity preparation append <dataset> <newPath>

Example:

singularity prep create myData /my/data-2020 /my/out
singularity prep append myData /my/data-2021
singularity prep append myData /my/data-2022

Remove a request

The whole data preparation requests can be removed from database. All generated CAR files can also be deleted by specifying --purge option.

singularity prep remove -h

List Deal Preparation Requests

List all the deal preparation requests, including whether scanning has completed and how many generation requests have completed or hit errors for each of them.

singularity prep list

Check Deal Preparation Request status

Check status for a specific deal preparation request, including the status of the initial scanning request and all corresponding generation requests.

singularity prep status -h

Check specific Deal Generation Request status

Look into a specific generation request, including what are the files or folders included in that request and their corresponding size, cid, selector, etc.

singularity prep generation-status -h

Get Slingshot 3.x Manifest for a Generation Request

singularity prep generation-manifest -h

Upload Slingshot 3.x Manifest to web3.storage

WEB3_STORAGE_TOKEN="eyJ..." singularity prep upload-manifest -h

Monitor service health and download speed

singularity monitor

Deal Replication

Deal replication module supports both lotus-market and boost based storage providers (later on we might deprecate lotus-market support). Currently it is required to have both lotus and boost cli binary in order for this module to work.

Deal Replication Configuration

Look for default.toml in the initialized repo, verify in the [deal_replication_worker] section, both binary can be accessed. If you need to specify environment variable like FULLNODE_API_INFO, it can also be specified there.

Setup Lotus Lite node

In order to make deals, we recommend setting up a lite node to use with the tool.

Once you have the lite node setup, you can import your wallet key for the verified client address.

Setup Boost Cli

If your target SP runs on Boost, boost executable is also needed to be able to make deal.

Once you have the boost cli initialized, you can import your wallet key for the verified client address.

Deal making

$ singularity repl start -h                                                                 
Usage: singularity replication start [options] <datasetid> <storage-providers> <client> [# of replica]

Start deal replication for a prepared local dataset

Arguments:
  datasetid                                            Existing ID of dataset prepared.
  storage-providers                                    Comma separated storage provider list
  client                                               Client address where deals are proposed from
  # of replica                                         Number of targeting replica of the dataset (default: 10)

Options:
  -u, --url-prefix <urlprefix>                         URL prefix for car downloading. Must be reachable by provider's boostd node. (default: "http://127.0.0.1/")
  -p, --price <maxprice>                               Maximum price per epoch per GiB in Fil. (default: "0")
  -r, --verified <verified>                            Whether to propose deal as verified. true|false. (default: "true")
  -s, --start-delay <startdelay>                       Deal start delay in days. (StartEpoch) (default: "7")
  -d, --duration <duration>                            Duration in days for deal length. (default: "525")
  -o, --offline <offline>                              Propose as offline deal. (default: "true")
  -m, --max-deals <maxdeals>                           Max number of deals in this replication request per SP, per cron triggered. (default: "0")
  -c, --cron-schedule <cronschedule>                   Optional cron to send deals at interval. Use double quote to wrap the format containing spaces.
  -x, --cron-max-deals <cronmaxdeals>                  When cron schedule specified, limit the total number of deals across entire cron, per SP.
  -xp, --cron-max-pending-deals <cronmaxpendingdeals>  When cron schedule specified, limit the total number of pending deals determined by dealtracking service, per SP.
  -l, --file-list-path <filelistpath>                  Path to a txt file that will limit to replicate only from the list. Must be visible by deal replication worker.
  -n, --notes <notes>                                  Any notes or tag want to store along the replication request, for tracking purpose.
  -csv, --output-csv <outputCsv>                       Print CSV to specified folder after done. Folder must exist on worker.
  -f, --force                                          Force resend even if this pieceCID have been proposed / active by the provider. (default: false)
  -h, --help                                           display help for command

A simple example to send all car files in one prepared dataset "CommonCrawl" to one storage provider f01234 immediately:

singularity repl start CommonCrawl f01234 f15djc5avdxihgu234231rfrrzbvnnqvzurxe55kja

A more complex example, send 10 deals to storage provider f01234 and f05678, every hour on the 1st minute from prepared dataset "CommonCrawl", until all CAR files are dealt.

singularity repl start -m 10 -c "1 * * * *" CommonCrawl f01234,f05678 f15djc5avdxihgu234231rfrrzbvnnqvzurxe55kja

Deal Making Self Service

Purpose

  1. Storage providers have full control of deal making speed
  2. Client no longer needs to spend time to pause or adjust deal making speed

Policy Management

Policy Creation

$ singularity repl ss create -h
Usage: singularity replication selfservice create [options] <client> <provider> <dataset>

Create a deal making self service policy

Arguments:
  client                       Client address to send deals from
  provider                     Provider address to send deals to
  dataset                      Id or name of the dataset

Options:
  --minDelay <minDelay>        Minimum delay in days for the deal start epoch (default: "7")
  --maxDelay <maxDelay>        Maximum delay in days for the deal start epoch (default: "7")
  -r, --verified <verified>    Whether to propose deal as verified. true|false. (default: "true")
  -p, --price <price>          Maximum price per epoch per GiB in Fil. (default: "0")
  --minDuration <minDuration>  Minimum duration in days for the deal (default: "525")
  --maxDuration <maxDuration>  maxDuration duration in days for the deal (default: "525")
  -h, --help                   display help for command

Policy Deletion

$ singularity repl ss delete -h
Usage: singularity replication selfservice delete [options] <id>

Delete a deal making self service policy

Arguments:
  id          Policy id to delete

Options:
  -h, --help  display help for command

Policy Listing

$ singularity repl ss list -h
Usage: singularity replication selfservice list [options]

List all deal making self service policies

Options:
  --json      Output with JSON format
  -h, --help  display help for command

Self Service API

Get Eligible PieceCids

curl "http://localhost:7005/pieceCids?provider=f0xxxx&dataset=datasetName"

Propose a deal

# Without pieceCid
$ curl "http://localhost:7005/propose?provider=f0xxxx&dataset=datasetName"
# With pieceCid
$ curl "http://localhost:7005/propose?provider=f0xxxx&dataset=datasetName&pieceCid=bafyxxxx"
# All possible options
$ curl "http://localhost:7005/propose?\
> provider=f0xxxx&\
> dataset=datasetName&\
> pieceCid=bafyxxxx&\
> startDays=7&\
> durationDays=525&\
> client=f0xxxxx"

The logic behind the scene is as follows:

  1. Try to find all policies that match the provider and dataset
  2. Filter all applicable policies by options provided, such as client, startDays, durationDays
  3. Randomly select one of the matching policy (this is possible if multiple client addresses are used for the same dataset)
  4. If pieceCid is provided, then check if the pieceCid belongs to the dataset and has not been proposed
  5. Otherwise, find a pieceCid from the dataset that has not yet been proposed to the provider
  6. Propose the deal and return the proposalId

Expose the API to provider

To only expose the /pieceCids and /propose API to SP, you can configure nginx like below

location /pieceCids {
    proxy_pass http://localhost:7005;
}
location /propose {
    proxy_pass http://localhost:7005;
}

Retrieval

The recommended way for Retrieval is via bitswap protocol. You need the storage provider to run booster-bitswap.

Then you may use ipfs get <RootCid>/sub/path/to/file to retrieve the file or folder. The ipfs version needs to be 0.18.0+.

The RootCid can be found in singularity prep list and will be automatically generated when the dataset is fully prepared.

If you find RootCid missing, or you're using an older version of Singularity (before 3.0.0), you can regenerate the RootCid by running singularity prep dag <dataset>. This will generate another CAR file that encapsulates the IPLD DAG of the whole dataset. You will need to get that new CAR file sealed before you can perform bitswap retrieval.

Configuration

Look for default.toml in the initialized repo.

[connection]

database

This sets the MongoDb connection string. The default value corresponds to the built-in MongoDb server shipped with this software. If you choose to use a standalone MongoDb service, set the connection string here.

deal_preparation_service

Sets the API endpoint of deal preparation service.

[database]

start_local

The software is shipping with a built-in MongoDb server. For small to medium-sized dataset, this should be sufficient.

For users who're onboarding large scale datasets, we recommend running your own MongoDb service which fits into your infrastructure by setting this value to false. To connect to a standalone MongoDb service, set the value of connection string here.

Not that the MongoDB server may consume as much as 80% of usable memory.

local_path, local_bind, local_port

The path of the database files the built-in MongoDb will be using, as well as the IP and port to bind the service to.

[deal_preparation_service]

Service to manage preparation requests

enabled, bind, port

Whether to enable the service and which IP and port to bind the service to

enable_cleanup

If the service crashes or is interrupted, there may be incomplete CAR files generated. Enabling this can clean them up.

minDealSizeRatio, maxDealSizeRatio

The default min/max ratio of CAR file size divided by the target deal size. The dataset splitting is performed with below logic

  1. Perform a Glob pattern match and get all files in sorted order
  2. Iterate through all the files and keep accumulating file sizes into a chunk
  3. Once the size of a chunk is between min and max ratio, pack this chunk to a CAR file and start with a new chunk
  4. If the size of the file is too large to fit into a chunk, split the file to hit the min ration

[deal_preparation_worker]

Worker to scan the dataset, make plan and generate Car file and CIDs

enabled, num_workers

Whether to enable the worker and how many worker instances. As a rule of thumb, use min(cpu_cores / 2, io_MBps / 20)

Performance

Resource usage

Each generation worker consumes negligible RAM, 20-50 MiB/s disk I/O and 100-250% of CPU.

Speed

Each 32GiB deal takes ~10 minutes to be generated on AMD EPYC CPU with NVME drive.

Other factors

  1. When dealing with lots of small files, CPU usage increases while generation speed decreases. Meanwhile, IO may become the bottleneck if not using SSD.
  2. When using S3 bucket public as the dataset, the Internet Speed may become the bottleneck

Backup

The repo ~/.singularity or the folder specified by SINGULARITY_PATH contains all state of the service. To backup, simply backup the repo folder.

Usage Collection

Starting version 2.0.0, anonymous data including error messages, data preparation and deal making statistics will be collected for us to better understand how the software is used and improve the software. To disable behavior, create and set metrics.enabled to false in default.toml.

Working with Docker [Experimental]

docker pull techgreedy/singularity
docker tag techgreedy/singularity singularity

# Initialize the repo config [optional]
docker run \
  -v ~/.singularity:/root/.singularity \
  singularity init

# Start daemon service in background
# Use ~/.singularity as the repo for config, database and logs
# Use /mnt/storage as the storage
docker run -d \
  -v ~/.singularity:/root/.singularity \
  -v /mnt/storage:/app/storage \
  -p 7001:7001 \
  singularity daemon

# Stop daemon service
docker ps | grep singularity | cut -d' ' -f1 | xargs docker kill

# Interact with the daemon with native singularity CLI
singularity prep create --force testData /app/storage/dataset /app/storage/output

# Interact with the daemon with dockerized singularity CLI
docker run -it --rm --network=host \
  singularity prep create --force testData /app/storage/dataset /app/storage/output

# Interact with the daemon with HTTP API directly
curl http://localhost:7001/preparations

FAQ and common issues

How to handle inaccessible files

Use --skip-inaccessible-files when creating the data preparation request singularity prep create.

For existing generation requests, use singularity prep retry gen --skip-inaccessible-files, however this currently only works when the tmpDir is used.

Does it work on Windows

This software is not extensively tested on Windows.

Error - too many open files

In case that one CAR contains more files than allowed by OS, you will need to increase the open file limit with ulimit , or LimitNOFILE if using systemd.

Error: Reached heap limit Allocation failed - JavaScript heap out of memory

Depending on the version, NodeJS by default has a max heap memory of 2GB. To increase this limit, i.e. to increase to 4G, set environment variable NODE_OPTIONS="--max-old-space-size=4096".

Error - open /some/file: remote I/O error

If you are using network mount such as NFS or Goofys, a temporary network issue may cause the CAR file generation to fail. If the error rate is less than 10%, you may assume they are transient and can be fixed by performing a retry. If the error is consistent, you will need to dig into the root cause of what have gone wrong. It could be incorrectly configured permission or DNS resolver, etc. You can find more details in /var/log/syslog.

Installation failed when using root

Avoid using root, or try the fix below

chown -R $(whoami) ~/
npm config set unsafe-perm true
npm config set user 0

Error: Instance Exited before being ready and without throwing an error

Something wrong while starting MongoDB. Check what has gone wrong

MONGOMS_DEBUG=1 singularity daemon

If the error shows libcrypto.so.1.1 cannot be found. Try this solution.

Submit Feedback

Create a bug report or request a feature.