phac-nml / nf-pipelines

Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Add cg/wgMLST allele calling pipeline #3

Open apetkau opened 1 year ago

apetkau commented 1 year ago

1. Purpose

The cg/wgMLST allele calling pipeline will be used for calling alleles from genomic sequence data.

Note: This is an in-development description of this pipeline.

2. Input

2.1. Sequence data

The main input for this pipeline will be genomic sequence data. This will be in the form of either reads or assemblies. This will be provided to Nextflow via a --input samplesheet.csv file. The SampleSheet will be structured as follows:

sample assembly fastq_1 fastq_2
SampleA /path/to/SampleA.fasta.gz
SampleB /path/to/SampleB_1.fastq.gz /path/to/SampleB_2.fastq.gz

2.2. MLST scheme

An MLST scheme will be provided, using the following parameters:

3. Steps

The steps of this pipeline are to generate a (cg/wg)MLST profile from the input data.

4. Output

4.1. Tabular allele files

A table of all allele identifiers for every locus in the scheme will be provided.

sample locus1 locus2 ...
SampleA 5 10 ...

4.2. JSON metadata

A JSON file output.json will be provided with all the allele calls structured in a way that they can be loaded by other systems (e.g., IRIDA Next). This will look like:

{
    "SampleA": {
        "listeria_cgmlst": {
            "locus1": 5,
            "locus2": 10,
        },
    },
    "SampleB": {
        "listeria_cgmlst": {
            "locus1": 1,
            "locus2": 10,
        },
    },
}
apetkau commented 1 year ago

Test implementation at https://github.com/apetkau/nf-core-mlstprofiler