phac-nml / nf-pipelines

Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Add nomenclature creation pipeline #1

Open apetkau opened 10 months ago

apetkau commented 10 months ago

1. Purpose

The nomenclature creation pipeline will generate a nomenclature from a collection of (cg/wg)MLST allelic profiles. Will make use of https://github.com/phac-nml/genomic_address_service.

2. Input

2.1. Allelic profiles

The main input will be a collection of allelic profiles passed as a CSV file of allelic profiles the --input parameter. This CSV file will be structured as follows:

profilesheet.csv: id profiles_format allele_profiles
profile_identifier1 csv /path/to/listeria.allele_profiles
profile_identifier2 parquet /path/to/salmonella.allele_profiles

The following will be valid fields for the input.

2.1.1. Allele profiles (CSV)

The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).

id loci1 loci2 ... lociN
SampleA be76 af5d ce78 d877a
ID10 af5d be76 ? d877a

Missing data will be represented as: ?, 0, - or space.

Other format structures described in

3. Steps

3.1. Deduplication

If the allele profiles referenced by the profilesheet.csv are identified by samples, then deduplication may be required to collapse samples with identical profiles.

3.1.1. Input

The input is an allele profile file like the following.

id loci1 loci2 ... lociN
SampleA be76 af5d ce78 d877a
SampleB be76 af5d ce78 d877a

3.1.2. Output

The output consists of a deduplicated profiles file and a mapping back to the original profiles.

profiles.deduplicated.csv id loci1 loci2 ... lociN
123abc be76 af5d ce78 d877a

profiles.samples.json:

{
    "123abc": ["SampleA", "SampleB"],
}

3.2. Construction of distance matrix

For every entry in the profilesheet.csv, a separate distance matrix will be constructed. This will make use of https://github.com/phac-nml/profile_dists

3.3. Creation of nomenclature data

For the created distance matrices, a collection of nomenclature data will be created using https://github.com/phac-nml/genomic_address_service

3.4. Creation of output metadata (output.json)

This step creates the output.json file from the nomenclature file.

4. Output

The following output will be provided. This will be communicated with an output.json file with the following larger structure:

{
    "files": { ... },
    "metadata": { ... }
}

4.1. Output files

The output.json data for files (the "files" section defined above) will look like:

{
    "profile_identifier1": {
        "distances": "identifier.distances.text",
        "thresholds": "identifier.thresholds.json",
        "clusters": "identifier.clusters.text",
        "tree": "identifier.tree.newick",
        "run": "identifier.run.json",
    },
    "profile_identifier2": { ... },
}

Where "identifier1" is derived from the identifiers in the `profilessheet.csv".

The output files consist of (output of https://github.com/phac-nml/genomic_address_service):

  1. ${identifier}.distances.{text|parquet} - Three column file of [query_id, ref_if, distance]
  2. {identifier}.thresholds.json - JSON formated mapping of columns to distance thresholds
  3. ${identifier}.clusters.{text|parquet} - Either symmetric distance matrix or three column file of [query_id, ref_if, distance]
  4. ${identifier}.tree.newick - Newick formatted dendrogram of the linkage matrix produced by SciPy
  5. ${identifier}.run.json - Contains logging information for the run including parameters, newick tree, and threshold mapping info

Here ${identifier} is derived from the input profilessheet.csv.

4.2. Output metadata

The following metadata will be provided:

{
    "files": { ... },

    "metadata": {
        "samples": {
            "SampleA": {
                "listeria_cgmlst": {
                    "address": "1.2.3",
            },
            "SampleB": {
                "salmonella_cgmlst": {
                    "address": "5.9.4",
                }
            }
        }
    }
}

The idea is that every sample will have stored the metadata under "SampleX" in data storage, which could then be accessed under listeria_cgmlst.address.

Question: Would it be better to store the deduplicated profile identifier mapped to a sample here, which will be a smaller set of data, and handle expanding to individual samples elsewhere?

5. Integration of data with IRIDA Next

In order for IRIDA Next to load results, it will look for the output.json file as described in Section 4.

5.1. Storing files

Anything under the files section will be stored in IRIDA associated with the analysis pipeline execution. These will be accessible by the key in the files section, for example clusters will give the file identifier.clusters.text.

5.2. Storing sample metadata

Sample metadata will be loaded up and associated with samples. For every sample identfied in the metadata.samples section, the associated metadata will be stored.

{
    "SampleA": {
        "listeria_cgmlst": {
            "address": "1.2.3",
        },
}

In IRIDA Next, there will be a parallel table that stores pipeline execution metadata for each field. For example:

{
    "SampleA": {
        "listeria_cgmlst": {
            "source": "analysis",
            "source_id": "1234",
        },
    },
}
apetkau commented 10 months ago

Initial implementation of pipeline here https://github.com/apetkau/nf-core-genomicnomenclature

You can run the pipeline tests with (assuming you have Nextflow and Docker installed):

nextflow run apetkau/nf-core-genomicnomenclature -profile docker,test -r dev -latest --outdir results