Open apetkau opened 10 months ago
Initial implementation of pipeline here https://github.com/apetkau/nf-core-genomicnomenclature
You can run the pipeline tests with (assuming you have Nextflow and Docker installed):
nextflow run apetkau/nf-core-genomicnomenclature -profile docker,test -r dev -latest --outdir results
1. Purpose
The nomenclature creation pipeline will generate a nomenclature from a collection of (cg/wg)MLST allelic profiles. Will make use of https://github.com/phac-nml/genomic_address_service.
2. Input
2.1. Allelic profiles
The main input will be a collection of allelic profiles passed as a CSV file of allelic profiles the
--input
parameter. This CSV file will be structured as follows:The following will be valid fields for the input.
csv
orparquet
(could be auto-detected from file extension or data as well).2.1.1. Allele profiles (CSV)
The following example format will be used for the allele profiles for the CSV format (both uncompressed and gzipped files will be supported).
Missing data will be represented as: ?, 0, - or space.
Other format structures described in
3. Steps
3.1. Deduplication
If the allele profiles referenced by the profilesheet.csv are identified by samples, then deduplication may be required to collapse samples with identical profiles.
3.1.1. Input
The input is an allele profile file like the following.
3.1.2. Output
The output consists of a deduplicated profiles file and a mapping back to the original profiles.
profiles.samples.json:
3.2. Construction of distance matrix
For every entry in the profilesheet.csv, a separate distance matrix will be constructed. This will make use of https://github.com/phac-nml/profile_dists
3.3. Creation of nomenclature data
For the created distance matrices, a collection of nomenclature data will be created using https://github.com/phac-nml/genomic_address_service
3.4. Creation of output metadata (
output.json
)This step creates the
output.json
file from the nomenclature file.4. Output
The following output will be provided. This will be communicated with an
output.json
file with the following larger structure:4.1. Output files
The
output.json
data for files (the"files"
section defined above) will look like:Where "identifier1" is derived from the identifiers in the `profilessheet.csv".
The output files consist of (output of https://github.com/phac-nml/genomic_address_service):
Here
${identifier}
is derived from the inputprofilessheet.csv
.4.2. Output metadata
The following metadata will be provided:
The idea is that every sample will have stored the metadata under "SampleX" in data storage, which could then be accessed under listeria_cgmlst.address.
Question: Would it be better to store the deduplicated profile identifier mapped to a sample here, which will be a smaller set of data, and handle expanding to individual samples elsewhere?
5. Integration of data with IRIDA Next
In order for IRIDA Next to load results, it will look for the
output.json
file as described in Section 4.5.1. Storing files
Anything under the
files
section will be stored in IRIDA associated with the analysis pipeline execution. These will be accessible by the key in thefiles
section, for exampleclusters
will give the fileidentifier.clusters.text
.5.2. Storing sample metadata
Sample metadata will be loaded up and associated with samples. For every sample identfied in the
metadata.samples
section, the associated metadata will be stored.In IRIDA Next, there will be a parallel table that stores pipeline execution metadata for each field. For example: