1. Purpose

The nomenclature creation pipeline will generate a nomenclature from a collection of (cg/wg)MLST allelic profiles. Will make use of https://github.com/phac-nml/genomic_address_service.

2. Input

2.1. Allelic profiles

The main input will be a collection of allelic profiles passed as a CSV file of allelic profiles the --input parameter. This CSV file will be structured as follows:

profilesheet.csv:	id	profiles_format	allele_profiles
profile_identifier1	csv	/path/to/listeria.allele_profiles
profile_identifier2	parquet	/path/to/salmonella.allele_profiles

The following will be valid fields for the input.

id: An identifier for the allelic profiles, could be the (cg/wg)MLST scheme name, or some other identifier.
profiles_format: The format of the profiles. One of csv or parquet (could be auto-detected from file extension or data as well).
allele_profiles: The allele profiles file, as either a CSV file, or a parquet file.

2.1.1. Allele profiles (CSV)

The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).

id	loci1	loci2	...	lociN
SampleA	be76	af5d	ce78	d877a
ID10	af5d	be76	?	d877a

Missing data will be represented as: ?, 0, - or space.

Other format structures described in

3. Steps

3.1. Deduplication

If the allele profiles referenced by the profilesheet.csv are identified by samples, then deduplication may be required to collapse samples with identical profiles.

3.1.1. Input

The input is an allele profile file like the following.

id	loci1	loci2	...	lociN
SampleA	be76	af5d	ce78	d877a
SampleB	be76	af5d	ce78	d877a

3.1.2. Output

The output consists of a deduplicated profiles file and a mapping back to the original profiles.

profiles.deduplicated.csv	id	loci1	loci2	...	lociN
123abc	be76	af5d	ce78	d877a

profiles.samples.json:

{
    "123abc": ["SampleA", "SampleB"],
}

3.2. Construction of distance matrix

For every entry in the profilesheet.csv, a separate distance matrix will be constructed. This will make use of https://github.com/phac-nml/profile_dists

3.3. Creation of nomenclature data

For the created distance matrices, a collection of nomenclature data will be created using https://github.com/phac-nml/genomic_address_service

3.4. Creation of output metadata (`output.json`)

This step creates the output.json file from the nomenclature file.

4. Output

The following output will be provided. This will be communicated with an output.json file with the following larger structure:

{
    "files": { ... },
    "metadata": { ... }
}

4.1. Output files

The output.json data for files (the "files" section defined above) will look like:

{
    "profile_identifier1": {
        "distances": "identifier.distances.text",
        "thresholds": "identifier.thresholds.json",
        "clusters": "identifier.clusters.text",
        "tree": "identifier.tree.newick",
        "run": "identifier.run.json",
    },
    "profile_identifier2": { ... },
}

Where "identifier1" is derived from the identifiers in the `profilessheet.csv".

The output files consist of (output of https://github.com/phac-nml/genomic_address_service):

${identifier}.distances.{text|parquet} - Three column file of [query_id, ref_if, distance]
{identifier}.thresholds.json - JSON formated mapping of columns to distance thresholds
${identifier}.clusters.{text|parquet} - Either symmetric distance matrix or three column file of [query_id, ref_if, distance]
${identifier}.tree.newick - Newick formatted dendrogram of the linkage matrix produced by SciPy
${identifier}.run.json - Contains logging information for the run including parameters, newick tree, and threshold mapping info

Here ${identifier} is derived from the input profilessheet.csv.

4.2. Output metadata

The following metadata will be provided:

{
    "files": { ... },

    "metadata": {
        "samples": {
            "SampleA": {
                "listeria_cgmlst": {
                    "address": "1.2.3",
            },
            "SampleB": {
                "salmonella_cgmlst": {
                    "address": "5.9.4",
                }
            }
        }
    }
}

The idea is that every sample will have stored the metadata under "SampleX" in data storage, which could then be accessed under listeria_cgmlst.address.

Question: Would it be better to store the deduplicated profile identifier mapped to a sample here, which will be a smaller set of data, and handle expanding to individual samples elsewhere?

5. Integration of data with IRIDA Next

In order for IRIDA Next to load results, it will look for the output.json file as described in Section 4.

5.1. Storing files

Anything under the files section will be stored in IRIDA associated with the analysis pipeline execution. These will be accessible by the key in the files section, for example clusters will give the file identifier.clusters.text.

5.2. Storing sample metadata

Sample metadata will be loaded up and associated with samples. For every sample identfied in the metadata.samples section, the associated metadata will be stored.

{
    "SampleA": {
        "listeria_cgmlst": {
            "address": "1.2.3",
        },
}

In IRIDA Next, there will be a parallel table that stores pipeline execution metadata for each field. For example:

{
    "SampleA": {
        "listeria_cgmlst": {
            "source": "analysis",
            "source_id": "1234",
        },
    },
}

phac-nml / nf-pipelines

Add nomenclature creation pipeline #1

1. Purpose

2. Input

2.1. Allelic profiles

2.1.1. Allele profiles (CSV)

3. Steps

3.1. Deduplication

3.1.1. Input

3.1.2. Output

3.2. Construction of distance matrix

3.3. Creation of nomenclature data

3.4. Creation of output metadata (`output.json`)

4. Output

4.1. Output files

4.2. Output metadata

5. Integration of data with IRIDA Next

5.1. Storing files

5.2. Storing sample metadata

phac-nml / nf-pipelines

Add nomenclature creation pipeline #1

1. Purpose

2. Input

2.1. Allelic profiles

2.1.1. Allele profiles (CSV)

3. Steps

3.1. Deduplication

3.1.1. Input

3.1.2. Output

3.2. Construction of distance matrix

3.3. Creation of nomenclature data

3.4. Creation of output metadata (output.json)

4. Output

4.1. Output files

4.2. Output metadata

5. Integration of data with IRIDA Next

5.1. Storing files

5.2. Storing sample metadata

3.4. Creation of output metadata (`output.json`)