kalininalab / metaprofi

MetaProFi is a bloom filter based tool for storing and querying sequence data for accurate identification of functionally relevant genetic variants
GNU General Public License v2.0
11 stars 1 forks source link

MetaProFi: An ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants

MetaProFi Workflow

Installation

Install MetaProFi as a command line tool using pip

Usage

Available subcommands

Command Summary
build To build Bloom filter matrix and to create index store.
build-seq To build Bloom filter matrix (every sequence in the input file (FASTA/FASTQ (.GZ)) will be considered as a sample) and to create index store.
update To build Bloom filter matrix for the new samples and to append/update the index with the new data
update-seq To build Bloom filter matrix for the new samples (every sequence in the input file (FASTA/FASTQ (.GZ)) will be considered as a sample) and to append/update the index with the new data
build_matrix To build Bloom filter matrix
build_index To build the index
update_index To update index with the new data
search_index To search/query the index
summary Extracts summary about the data using the index store

General information

  1. Please make sure to use the hardware with the same endianness during the build, update, and query sessions. Mixing of these is not allowed in MetaProFi.

config.yml:

MetaProFi requires a config.yml file which can be downloaded from here
Note: MetaProFi requires same config used for building to be used during updates as well (can increase max_memory and nproc)

Subcommands and its parameters

build

Builds Bloom filter matrix and creates index

Positional Summary
input_file Input file containing lines each of which specifies a sample identifier and one or more compressed or uncompressed FASTA or FASTQ file path (refer here for format)
config_file Path to the configuration file
  Example:
  metaprofi build /path/to/input_file.txt /path/to/config.yml

build-seq

To build sequence level index (as an alternative to BLAST for example), every sequence in the input file (FASTA/FASTQ (.GZ)) will be considered as a sample and a Bloom filter per sequence will be created in the matrix

NOTE: Sample identifier will be extracted from the header of each sequence

Positional Summary
input_file Input file can be either a FASTA or a FASTQ (.GZ) file with one or more sequences
config_file Path to the configuration file
  Example:
  metaprofi build-seq /path/to/input_file.gz /path/to/config.yml

update

To build Bloom filter matrix for the new samples and then to append/update the index with the new data

NOTES

  1. Looks for the index store directory (index_store_name set in the config file) in the output_directory path set in the config file
  2. Do not forget to change the value of the matrix_store_name in the config file for the update
Positional Summary
input_file Input file containing lines each of which specifies a sample identifier and one or more compressed or uncompressed FASTA or FASTQ file path (refer here for format)
config_file Path to the configuration file
  Example:
  metaprofi update /path/to/input_file.txt /path/to/config.yml

update-seq

To build Bloom filter matrix for the new samples (every sequence in the input file (FASTA/FASTQ (.GZ)) will be considered as a sample) and then to append/update the index with the new data.

NOTES

  1. Sample identifier will be extracted from the header of each sequence
  2. Looks for the index store directory (index_store_name set in the config file) in the output_directory path set in the config file
  3. Do not forget to change the value of the matrix_store_name in the config file for the update
Positional Summary
input_file Input file can be either a FASTA or a FASTQ (.GZ) file with one or more sequences
config_file Path to the configuration file
  Example:
  metaprofi update-seq /path/to/input_file.gz /path/to/config.yml

build_matrix

Alternate function for build subcommand. This is used when one wants to create the Bloom filter matrix only and does not wish to create the index immediately.

Positional Summary
input_file Input file containing lines each of which specifies a sample identifier and one or more compressed or uncompressed FASTA or FASTQ file path (refer here for format)
config_file Path to the configuration file
  Example:
  metaprofi build_matrix /path/to/input_file.txt /path/to/config.yml

build_index

Alternate function for build subcommand. To use this one should first use the build_matrix subcommand and build the Bloom filter matrix and then use this subcommand to build the index.

NOTE: Looks for the matrix store directory (matrix_store_name set in the config file) in the output_directory path set in the config file

Positional Summary
config_file Path to the configuration file
  Example:
  metaprofi build_index /path/to/config.yml

update_index

To append/insert new data to the index. First, one should use the build_matrix subcommand to build the Bloom filter matrix for the new data and then use this subcommand to append/update the index with the new data.

NOTE: Looks for the matrix store directory (matrix_store_name set in the config file) and the index store directory (index_store_name set in the config file) in the output_directory path set in the config file

Positional Summary
config_file Path to the configuration file
  Example:
  metaprofi update_index /path/to/config.yml

search_index

Search/Query the sequence against the index

Positional Summary
config_file Path to the configuration file
Flags Summary Required Default
-s Provide an input sequence (nucleotides/aminoacids) to search against the index
NOTE: Use either -s or -f flag and not both
Yes (Check Note)* None
-f Provide an input FASTA/FASTQ (.GZ) file containing sequence(s) to search against the index
NOTE: Use either -s or -f flag and not both
Yes (Check Note)* None
-i Provide the type of the query sequence (e.g., nucleotide or aminoacid) Yes None
-t Provide a threshold value to invoke approximate search (e.g., 50)
NOTE: Number should be between 1 and 100
No 100
  Example1: Search for a sequence where at least 50% of k-mers are found (approximate search)
  metaprofi search_index -s 'AGCCGGCCCGCCCGCCCGGGTCTGACC' -i nucleotide -t 50

  Example2: Search for a sequence where at least 75% of k-mers are found (approximate search)
  metaprofi search_index -s 'HIMHLIHIRAFFLDYNIYCIHRFNQSHRA' -i aminoacid -t 75

  Example3: Search for all sequences in the FASTA file (exact search)
  metaprofi search_index -f input_protein.fasta -i aminoacid -t 100

  Example4: Search for all sequences in the FASTQ file (exact search)
  metaprofi search_index -f input_dna.fastq -i nucleotide -t 100

  Results: Can be found in a file named 'metaprofi_query_results_<datetime>_t<threshold>.txt' in the output directory path set in the config file*

NOTE: sequence_type in the config file will be used to determine if we are searching against the aminoacid or nucleotide index and to determine whether six frame translation needs to be performed on the input if nucleotide query is used

summary

To get summary about the data

NOTE: Looks for the index store directory (index_store_name set in the config file) in the output_directory path set in the config file

Positional Summary
config_file Path to the configuration file
  Example:
  metaprofi summary /path/to/config.yml

Full example:

Create a MetaProFi index for all UniProt bacterial sequences

How to cite us

Sanjay K. Srikakulam, Sebastian Keller, Fawaz Dabbaghie, Robert Bals, Olga V. Kalinina MetaProFi: A protein-based Bloom filter for storing and querying sequence data for accurate identification of functionally relevant genetic variants. Submitted