phac-nml / nf-pipelines

Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Add cg/wgMLST query pipeline #2

Open apetkau opened 1 year ago

apetkau commented 1 year ago

1. Purpose

The purpose of this pipeline is to query for genomes within a certain threshold of a collection of genomes.

2. Input

2.1. Query profiles

The input will consist of cg/wgMLST profiles for queries and a reference selection/scope of this query. This will be passed via the --input parameter and will look like the following:

querysheet.csv:

identifier query_allele_profiles reference_allele_profiles
query1 /path/to/query_allele_profiles /path/to/reference_allele_profiles
query2 /path/to/query_allele_profiles /path/to/reference_allele_profiles

2.1.1. Allele profiles (CSV)

The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).

id loci1 loci2 ... lociN
SampleA be76 af5d ce78 d877a
ID10 af5d be76 ? d877a

3. Steps

3.1. Perform query

For each listed query, this will search for genomes within a particular threshold. This will use https://github.com/phac-nml/profile_dists.

4. Output

The following output will be provided. This will be communicated with an output.json file with the following larger structure:

{
    "files": { ... },
    "sample_metadata": { ... }
    "execution_metadata": { ... },
}
apetkau commented 1 year ago

Example implementation is at https://github.com/apetkau/nf-core-queryprofiles

This can be run directly from GitHub if you have Nextflow and Docker installed by:

nextflow run https://github.com/apetkau/nf-core-queryprofiles -profile docker,test -r dev --outdir results