jbloomlab / seqneut-pipeline

Pipeline for analyzing sequencing-based neutralization assays
MIT License
0 stars 0 forks source link

seqneut-pipeline for analyzing sequencing-based neutralization assays

Release Build Status License: MIT Code style: black Ruff Snakemake


This is a modular analysis pipeline for analyzing high-throughput sequencing-based neutralization assays of the type developed in the Bloom lab. See Loes et al (2024) for a description of these assays.

Please cite Loes et al (2024) if you use this pipeline for your scientific study.

See here for a list of the contributors to this pipeline.

Overview

This pipeline goes from the FASTQ files that represent the counts of each barcoded viral variant to the computed neutralization titers for each sera. The titers are computed by fitting Hill-curve style neutralization curves using the neutcurve package; see the documentation for the details of these curves. The titers represent the reciprocal serum dilutions at which half the viral infectivity is neutralized. The pipeline provides options to compute these titers as either:

When the curves are fit with a top plateau of 1 and a bottom plateau of zero, these two different ways of calculating the titers are identical, and represent the serum dilution factor at which the serum neutralizes half of the viral infectivity. Typically there will be multiple replicates, and the final reported titer is the median among replicates.

The pipeline also performs extensive quality control at different steps using configurable options described below.

Using this pipeline

This pipeline is designed to be included as a modular portion of a larger snakemake analysis. This pipeline processes FASTQ files to get the counts of each barcode, analyzes those to determine the fraction infectivity at each serum concentration, and then fits and plots neutralization curves.

To use the pipeline, first create another repository specific to your project. The include this pipeline as a git submodule in your repository, where it will be present in a subdirectory named seqneut-pipeline. So the overall structure will look like this:

<your_project_repo>
├── seqneut-pipeline [added as git submodule]
│   ├── seqneut-pipeline.smk [snakemake rules for pipeline]
│   ├── environment.yml [conda environment for pipeline]
│   └── ... [other contents of seqneut-pipeline]
├── README.md [README for your project]
├── config.yml [YAML configuration for project]
├── Snakefile [top-level snakemake file that includes `seqneut-pipeline.smk`]
├── data [subdirectory with input data for your project]
├── results [subdirectory with results created by pipeline]
├── docs [HTML summary of results created by pipeline]
└── <other files / subdirectories that are part of project>

So after you have created your project repo, add this pipeline as a git submodule with:

  git submodule add https://github.com/jbloomlab/seqneut-pipeline

This creates a file called gitmodules and the seqneut-pipeline submodule, which can then be committed to the repo. If at some point you want to update the version of the pipeline, simply cd into the seqneut-pipeline subdirectory and pull or checkout the version you want.

To use the pipeline, you then need to add a few things to your main-level repo. The first is a top-level Snakefile that includes your configuration, seqneut-pipeline, and outputs of seqneut-pipeline as targets of the all rule. So minimally that top-level Snakefile should contain the following lines (it can also contain additional stuff if you also have it running project-specific analyses on the output of the pipeline):

import os

configfile: "config.yml"

include: os.path.join(config["seqneut-pipeline"], "seqneut-pipeline.smk")

rule all:
    input:
        seqneut_pipeline_outputs

In this Snakefile, the seqneut_pipeline_outputs specify files created by the pipeline. Several of of these may be of special interest for you to use in additional rules you define in Snakefile:

In addition, you need to create the configuration file config.yml and ensure it includes the appropriate configuration for seqneut-pipeline as described below.

To track the correct files in the created results, we suggest you copy the ./test_example/.gitignore file to be the .gitignore for your main repo. This will track key results files, but not an excessive number of non-essential files.

Finally, you need to create a conda environment that minimally includes the packages needed to run the pipeline, which are a recent version of snakemake and pandas. You can either create your own environment containing these, or simply build and use the one specified in environment.yml file of seqneut-pipeline, which is named seqneut-pipeline. So if you are using that environment, you can simply run the pipeline with:

conda activate seqneut-pipeline
snakemake -j <n_jobs> --software-deployment-method conda

Note also that a few rules have rule-specific conda environments in ./envs/.

Configuring the pipeline

The configuration for the pipeline is in a file called config.yml. An example configuration file is in ./test_example/config.yml (although some of the QC thresholds are set more leniently to make the test example work for small data as described in the comments in that YAML).

Here we describe the required keys in this YAML file (you can also add additional information specific to your repo, but we advise adding comments to put that in a separate part of the YAML from the seqneut-pipeline configuration). For background on YAML format, including on the anchor (&) and merge (<<: *) operators that can be helpful to simplify the YAML file, see here and here.

The top-level keys in the YAML are:

seqneut-pipeline

Location of the seqneut-pipeline relative to the top-level repo. This will almost always be a subdirectory of the same name, so this key will be as shown below unless you have a good reason to do otherwise:

    seqneut-pipeline: seqneut-pipeline

docs

Location where we create the ./docs/ subdirectory with HTMLs for rendering on GitHub pages. This will almost always be docs, so this key will be as shown below unless you have a good reason to do otherwise:

    docs: docs

description

Description of pipeline, used in the HTML docs rendering. Should include title (with markdown # heading, authors and/or citation, and link to GitHub repo. For instance:

description: |
  # <title>
  <short description>

  <authors and/or link to citation>

  See <GitHub repo link> for code and numerical data.

viral_libraries

A dictionary (mapping) of viral library names to CSV files holding these libraries. So in general this key will look like:

viral_libraries:
  pdmH1N1_lib2023_loes: data/viral_libraries/pdmH1N1_lib2023_loes.csv
  <potentially more viral libraries specified as name: CSV pairs>

The recommended way to organize the viral libraries (as indicated above) is to put them in a ./data/viral_libraries/ subdirectory.

The CSV files themselves will have columns specifying the viral barcode and the strain it corresponds to, such as:

barcode,strain
ACGGAATCCCCTGAGA,A/Washington/23/2020
GCATGGATCCTTTACT,A/Togo/845/2020
<additional lines>

viral_strain_plot_order

A a CSV with a column named "strain" that lists the strains in the order they should be plotted. If not specified or set to "null", then plotting is just alphabetical. Must include all strains being used if specified. So should look like this:

viral_strain_plot_order: data/viral_strain_plot_order.csv

The CSV file itself will just have a column named "strain" specifying the order, such as:

strain
A/California/07/2009
A/Michigan/45/2015
<additional lines>

neut_standard_sets

A dictionary (mapping) of neutralization-standard set names to CSV files holding the barcodes for the neutralization standard set. So in general, this key will look like:

neut_standard_sets:
  loes2023: data/neut_standard_sets/loes2023_neut_standards.csv
  <potentially more neut standard sets specified as name: CSV pairs>

The recommended way to organize the neutralization-standard sets (as indicated above) is to put them in a ./data/neut_standard_sets/ subdirectory.

The CSV files need just a single column specifying the neutralization standard barcode, such as:

barcode
CTTTAAATTATAGTCT
CATACAGAGTTTGTTG
<additional lines>

illumina_barcode_parser_params

A dictionary (mapping) specifying how to parse the Illumina FASTQ files to barcode counts. This is a global dictionary that is applied to all plates, but can be augmented or overriden on a per-plate basis by specifying plate specific illumina_barcode_parser_params as described in the plate configuration below. This mapping should just specify key-word arguments that can be passed to dms_variants.illuminabarcodeparser.IlluminaBarcodeParser; note that the barcode-length (bclen) parameter should not be specified as it is inferred from the length of the barcodes.

So in general, this key will look like this:

illumina_barcode_parser_params:
  upstream: CTCCCTACAATGTCGGATTTGTATTTAATAG
  downstream: ''
  minq: 20
  upstream_mismatch: 4
  bc_orientation: R2

plates

This dictionary (mapping) contains the heart of the configuration, and may be quite large. Essentially, it specifies what samples are contained in each plate, how those samples should be processed, QC thresholds, and any specific barcodes or samples that should be dropped. In addition, each plate is assigned to a group, which might be "serum" or "pilot" (if you are mixing analyses of your sera with pilot experiments), or could be additional groups if you have two distinct sets of sera.

The basic structure is that plates maps plate names to configurations for the plates. Specifically, it should look like this:

plates:

  plate1:
    group: serum
    date: 2023-08-01
    viral_library: pdmH1N1_lib2023_loes
    neut_standard_set: loes2023
    samples_csv: data/plates/plate1_samples.csv
    manual_drops: {}
    qc_thresholds:
      <<: *default_process_plate_qc_thresholds
    curvefit_params:
      <<: *default_process_plate_curvefit_params
    curvefit_qc:
      <<: *default_process_plate_curvefit_qc
    illumina_barcode_parser_params:  # optional argument
        upstream2: GCTACA

  <additional_plates>

The above example shows the configuration of a plate called plate1, and there may be many additional plates. The elements under each plate-mapping are in turn as follows:

group

The group that this plate is assigned to (cannot contain any underscores). Typically this might be "serum" or "pilot" or however you are categorizing the runs.

date

The date key specifies the date on which the plate was processed in YYYY-MM-DD format.

viral_library

The viral_library key gives the name of a key in viral_libraries that specifies the barcodes / strains for the viral library used for this plate.

neut_standard_set

The neut_standard_set key gives the name of a key in neut_standard_sets that specifies the neutralization-standard set barcodes used for this plate.

samples_csv

The samples_csv key gives the name of a CSV file specifying the samples for that plate. The recommended way to organize these sample CSVs is to put them in ./data/plates/ subdirectory. The CSV file must have the following columns:

Here are a few lines of an example CSV file:

well,serum,dilution_factor,replicate,fastq
A1,none,,1,/fh/fast/bloom_j/SR/ngs/illumina/aloes/230801_VH00319_391_AACWKHTM5/Unaligned/Project_aloes/Plate1_Noserum1_S1_R1_001.fastq.gz
A2,Y106d182,20.0,1,/fh/fast/bloom_j/SR/ngs/illumina/aloes/230801_VH00319_391_AACWKHTM5/Unaligned/Project_aloes/Y106_d182_conc1_S9_R1_001.fastq.gz
<additional lines>

manual_drops

As you analyze plates, you may find specific barcodes, wells, etc that you want to drop even if they don't fail the QC if they appear problematic to you for some reason. If so, specify them using this key (if you don't want to manually drop any data for the plate, then just set this key to an empy dictionary {}).

The manual drops can have the following keys:

So for instance, you could have this specification if you wanted to drop barcode AGTCCTATCCTCAAAT for all wells of serum-replicate M099d0

manual_drops:
  barcode_serum_replicates:
    - [AGTCCTATCCTCAAAT, M099d0]

qc_thresholds

This key defines a mapping of the quality-control thresholds for processing the sequencing counts to get fraction infectivities. These thresholds are used for the QC in the process_count rule (see section below on quality-control for more details).

Since it is a bit complex, you typically will want to use the YAML anchor / merge syntax to define a default that you then merge for specific plates. The default can be defined like this:

default_process_plate_qc_thresholds: &default_process_plate_qc_thresholds
  avg_barcode_counts_per_well: 500
  min_neut_standard_frac_per_well: 0.005
  no_serum_per_viral_barcode_filters:
    min_frac: 0.0005
    max_fold_change: 3
    max_wells: 2
  per_neut_standard_barcode_filters:
    min_frac: 0.005
    max_fold_change: 3
    max_wells: 2
  min_neut_standard_count_per_well: 1000
  min_no_serum_count_per_viral_barcode_well: 30
  max_frac_infectivity_per_viral_barcode_well: 5
  min_dilutions_per_barcode_serum_replicate: 6

and then for specific plates you can merge this default it in and overwrite any specific keys if needed. For instance, below would merge the above defaults but then overwrite the min_viral_barcode_frac to a different value:

plates:

  plate1:
    <other keys>
    qc_thresholds:
      <<: *default_process_plate_qc_thresholds  # merge in defaults
      avg_barcode_counts_per_well: 1000  # overwrite default for this key for this plate

  <other plates>

The QC-thresholds defined here are applied in order to drop data (wells, barcodes, etc) when processing the plates. Specifically:

curvefit_params

This key defines some parameters specifying how the neutralization curves are fit, which is done using the Hill curves defined in the neutcurve package.

You typically want to use the YAML anchor/merge syntax to define a default that you then merge for specific plates. The default can be defined like this:

default_process_plate_curvefit_params: &default_process_plate_curvefit_params
  frac_infectivity_ceiling: 1
  fixtop: [0.75, 1.0]
  fixbottom: 0
  fixslope: [0.8, 10]

The specific meaning of these curve-fitting parameters are as follows:

curvefit_qc

This key defines some parameters on quality-control performed after the curve-fitting; viral-barcode / serum-replicate combinations that fail this QC are dropped.

You typically want to use the YAML anchor/merge syntax to define a default that you then merge for specific plates. The default can be defined like this:

default_process_plate_curvefit_qc:  &default_process_plate_curvefit_qc
  max_frac_infectivity_at_least: 0
  goodness_of_fit:
    min_R2: 0.75
    max_RMSD: 0.05
  serum_replicates_ignore_curvefit_qc: []
  barcode_serum_replicates_ignore_curvefit_qc: []

The specific meanings of these QC parameters are:

illumina_barcode_parser_params

This key defines parameters for the illuminabarcodeparser that override anything set in the global illumina_barcode_parser_params above. It is optional, and if not defined just the global params are used. If this is defined, it is used to update the global params (adding new params and overriding any shared ones). The main anticipated use case is if you add plate-specific indices in the round 1 PCR and want to specify those indices here using upstream2 and upstream2_mismatch.

default_serum_titer_as

Specifies how we compute the final titers in serum_titers. Can be either midpoint or nt50 depending on whether you want to report the value where the fraction infectivity gets to 50%, or the midpoint of the curve, so should be either

default_serum_titer_as: midpoint

or

default_serum_titer_as: nt50

The difference only becomes relevant if some your curves have plateaus substantially different than zero and one.

If you want to handle specific sera different, see sera_override_defaults.

default_serum_qc_thresholds

Default QC we apply to each serum-virus pair when reporting out the final titers (medians across replicaes) in serum_titers. Any serum-virus pair that fails this QC does not have a titer reported unless it is specified in sera_override_defaults.

Should look like this:

default_serum_qc_thresholds: &default_serum_qc_thresholds
  min_replicates: 2
  max_fold_change_from_median: 3
  viruses_ignore_qc: []

where:

sera_override_defaults

Override default_serum_titer_as or default_serum_qc_thresholds for specific sera in each group (recall groups are assigned per-plate). For instance, this could look like:

sera_override_defaults:
  serum:
    M099d30:
      qc_thresholds:
        <<: *default_serum_qc_thresholds
        viruses_ignore_qc:
          - A/Belgium/H0017/2022
    Y044d30:
      qc_thresholds:
        <<: *default_serum_qc_thresholds
        max_fold_change_from_median: 4
      titer_as: nt50

The above means that in the group called serum, for serum M099d30 we override the default_serum_qc_thresholds to exclude virus A/Belgium/H0017/2022, and for serum Y044d30 we override the defaults to allow a greater fold-change from median for individual replicates, and compute the titer as nt50. Anything not listed here gets handled by the defaults in default_serum_titer_as and default_serum_qc_thresholds.

miscellaneous_plates

This is an optional key that can be used specify plates that you just want to count barcodes for, and then analyze those counts outside the main pipeline. This might be useful for library pooling or QC, for instance---or if you want to look at some failed plates that you don't actually want to fit curves for.

If you do not want to specify any miscellaneous plates either leave this key out or set it to an empty dictionary ({}).

The key should look like this:

miscellaneous_plates:

  <plate_name_1>:
    date: <date>
    viral_library: <viral library>
    neut_standard_set: <standard set>
    samples_csv: <filename>
    illumina_barcode_parser_params:  # optional key
        <parser params to override global>

  <plate_name_2>:
    ...

The plate name is just the name assigned to the plate. The date, viral_library, neut_standard_set, and illumina_barcode_parser_params keys have the same meaning as for the plates specified under plates.

The samples_csv should specify the samples to analyze in a CSV that has columns named "well" and "fastq", and optionally other columns as well.

The output is that for each plate, the following files are created:

Results of running the pipeline

The results of running the pipeline are put in the ./results/ subdirectory of your main repo. We recommend using the .gitignore file in [./test_example/.gitignore] in your main repo to only track key results in your GitHub repo. The key results if the pipeline runs to completion are in ./results/aggregated_titers/titers_{group}.csv for each group of sera. The set of full created outputs are as follows (note only some will be tracked depending on your .gitignore):

Examining the output and setting appropriate QC values in the configuration

When you run the pipeline, the QC values in the configuration will be automatically applied, and HTML notebooks summarizing the processing of each plate and sera are rendered in ./docs, alongside a summary of all QC across all plates / sera. YAML summaries of the QC are also created.

While the QC is designed to hopefully make reasonable default choices, you should always carefully look through these notebooks after adding new data, and potentially adjust the QC in the configuration and re-run.

Rendering HTML plots and notebooks in docs

If the pipeline runs to completion, it will create HTML documentation with plots of the overall titers, per-serum titer analyses, per-plate analyses and overall QC summary in a docs subdirectory, which will typically named be ./docs/ (if you use suggested key in configuration YAML). This HTML documentation can be rendered via GitHub Pages from the ./docs/ directory.

Looking at this documentation is a good way to QC the data and understand the results.

The documentation for the test example for this pipeline is at https://jbloomlab.github.io/seqneut-pipeline/.

If you want to add additional HTML files to the docs, specify a dict in the top-level Snakefile with the name add_htmls_to_docs like this:

add_htmls_to_docs = {
    "Additional files": {
        "Example HTML file":  "results/extra_htmls/example_html.html",
        <other keys specifying file names and their paths>
    },
    <other nested dicts with a heading and then name: file key-value pairs>
}

Test example and testing via GitHub Actions

The ./test_example subdirectory contains a small test example that illustrates use of the pipeline.

The code is tested by running this example, as well as formatted with black and snakefmt and linted with ruff and snakemake --lint via the GitHub Action specified in .github/workflows/test.yaml.