1. Purpose

The purpose of this pipeline is to query for genomes within a certain threshold of a collection of genomes.

2. Input

2.1. Query profiles

The input will consist of cg/wgMLST profiles for queries and a reference selection/scope of this query. This will be passed via the --input parameter and will look like the following:

querysheet.csv:

identifier	query_allele_profiles	reference_allele_profiles
query1	/path/to/query_allele_profiles	/path/to/reference_allele_profiles
query2	/path/to/query_allele_profiles	/path/to/reference_allele_profiles

2.1.1. Allele profiles (CSV)

The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).

id	loci1	loci2	...	lociN
SampleA	be76	af5d	ce78	d877a
ID10	af5d	be76	?	d877a

3. Steps

3.1. Perform query

For each listed query, this will search for genomes within a particular threshold. This will use https://github.com/phac-nml/profile_dists.

4. Output

The following output will be provided. This will be communicated with an output.json file with the following larger structure:

{
    "files": { ... },
    "sample_metadata": { ... }
    "execution_metadata": { ... },
}

phac-nml / nf-pipelines

Add cg/wgMLST query pipeline #2