scholl-lab / plasmicheck

Detect and quantify plasmid DNA contamination in sequencing data
MIT License
0 stars 0 forks source link
bioinformatics contamination-detection genomics minimap2 next-generation-sequencing plasmid splicing

plasmicheck: Detect and quantify plasmid DNA contamination in sequencing data

plasmicheck Logo

'plasmicheck' is a comprehensive tool for detecting and quantifying plasmid DNA contamination in sequencing data. It provides a fully integrated pipeline for handling all steps, even with large batches of sequencing files and plasmid inputs. The tool is efficient and easy to use, and it automates all necessary processes, from initial data conversion to final report generation.

Logic and Key Steps

The core logic of plasmicheck revolves around three key steps:

  1. Spliced Alignment: To accurately detect plasmid contamination, 'plasmicheck' uses a spliced alignment in which the cDNA insert (representing spliced mRNA) is mapped against the human genome. This step determines which human genomic regions correspond to the cDNA insert, resulting in a plasmid-specific human reference sequence. This approach is agnostic to the plasmid reference and requires no additional knowledge about the gene cloned into the plasmid, making it widely applicable.

  2. Read Alignment: The sequencing reads are aligned to both the plasmid reference and a plasmid-specific human reference generated by spliced alignment. This alignment step is critical for determining where the reads map, which serves as the foundation for contaminant detection.

  3. Alignment Comparison: After the reads have been aligned to both references, the tool compares the alignments to see if there are more reads aligned to the plasmid reference than the human reference. Performing this comparison is crucial for detecting the presence of plasmid DNA contamination, as a greater degree of alignment with the plasmid indicates the presence of contamination.

Additional Functionality and Conversion Steps

Beyond its core logic, 'plasmicheck' offers a comprehensive set of functionalities aimed at streamlining the entire contamination detection process, even when dealing with multiple sequencing files and plasmid inputs. The tool provides complete pipeline integration, automating every step from raw data to final reports, resulting in a smooth and efficient workflow.

  1. Plasmid File Conversion: plasmicheck supports the conversion of plasmid sequences from GenBank / xDNA format to FASTA, facilitating compatibility with alignment tools.

  2. Indexing: To accelerate the alignment process, plasmicheck automatically generates indices for both plasmid and human reference sequences using minimap2 and samtools.

  3. Report Generation: After comparing the alignments, plasmicheck generates a detailed report summarizing the results. This report includes contamination verdicts, read assignments, and visual representations such as boxplots and dotplots.

  4. Summary Reports: For users working with multiple samples, plasmicheck can generate summary reports that aggregate results across different plasmids and sequencing files. The heatmap plots and tabular representations help users visualize the extent of contamination across different samples and plasmids, making it easier to interpret the results.

Installation

Option 1: Using pip

You can install plasmicheck using pip:

pip install .

Option 2: Using Conda Environment

Alternatively, you can set up a Conda environment using the provided plasmicheck_full_conda.yml file (which includes all dependencies) as follows:

  1. Clone the repository:

    git clone https://github.com/berntpopp/plasmicheck.git
    cd plasmicheck
  2. Create the Conda environment:

    conda env create -f conda/conda/plasmicheck_full_conda.yml
  3. Activate the environment:

    conda activate plasmicheck

Required Tools and Python Packages

Make sure you have the following tools and packages installed:

You can also install the Python packages using pip:

pip install biopython pysam jinja2 weasyprint matplotlib seaborn pandas scipy plotly statsmodels numpy

Configuration

The tool uses a config.json file for various settings, including the number of threads to use for minimap2 and samtools. Below is an example configuration snippet:

{
  "alignment": {
    "minimap2_threads": 8,
    "samtools_threads": 4
  },
  "indexing": {
    "minimap2_options": ["-d"],
    "samtools_options": []
  },
  ...
}

Usage

plasmicheck provides a command-line interface with the following commands:

Run the Full Pipeline

plasmicheck pipeline <human_fasta> <plasmid_files> <sequencing_files> <output_folder> [--keep_intermediate] [--shift_bases <shift_bases>] [--generate_shifted] [--overwrite] [--padding <padding>] [--threshold <threshold>]

Convert GenBank Files to FASTA

plasmicheck convert <input_file> <output_file> [--shift_bases <shift_bases>] [--generate_shifted] [--overwrite]

Create Minimap2 and Samtools Indexes for a FASTA File

plasmicheck index <fasta_file> [--overwrite]

Align Reads to References

plasmicheck align <reference_index> <input_file> <output_bam> <alignment_type> [--fastq2 <fastq2>]

Compare Alignments and Assign Reads

plasmicheck compare <plasmid_bam> <human_bam> <output_basename> [--threshold <threshold>]

Perform Spliced Alignment and Extract Human Reference Regions

plasmicheck spliced <output_fasta> <human_index> <plasmid_fasta> <output_bam> [--human_fasta <human_fasta>] [--padding <padding>]

Generate a Report

plasmicheck report <reads_assignment_file> <summary_file> <output_folder> [--threshold <threshold>]

Generate Summary Reports for Multiple Samples and Plasmids

plasmicheck summary_reports -i <input_dir> -o <output_dir> [--threshold <threshold>]

Example Usage

Here is an example workflow using plasmicheck:

  1. Convert GenBank files to FASTA:

    plasmicheck convert ./genbank_files/plasmid.gb ./fasta_files/plasmid.fasta
  2. Create Minimap2 and Samtools indexes:

    plasmicheck index ./fasta_files/plasmid.fasta --overwrite
    plasmicheck index ./fasta_files/human.fasta --overwrite
  3. Align reads to plasmid and human references:

    plasmicheck align ./indexes/plasmid_index.mmi ./indexes/human_index.mmi ./reads/sample_R1.fastq.gz ./reads/sample_R2.fastq.gz ./alignments/plasmid_aligned.bam ./alignments/human_aligned.bam
  4. Compare alignments and assign reads:

    plasmicheck compare ./alignments/plasmid_aligned.bam ./alignments/human_aligned.bam ./results/read_assignments
  5. Run the full pipeline:

    plasmicheck pipeline ./reference/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna ./reference/plasmid/pcdna5_cacna1h_genbank.gb ./data/APA19-N.merged.dedup.bqsr.apa-genes.bam ./output/ --threshold 0.8
  6. Generate a report:

    plasmicheck report ./output/APA19-N.merged.dedup.bqsr.apa-genes/pcdna5_cacna1h_genbank/comparison_result.reads_assignment.tsv ./output/APA19-N.merged.dedup.bqsr.apa-genes/pcdna5_cacna1h_genbank/comparison_result.summary.tsv ./output/
  7. Generate summary reports:

    plasmicheck summary_reports -i ./output/ -o ./summary/ --threshold 0.8

Pipeline Diagram

graph TD
    A[GenBank File] -->|convert_gb_to_fasta.py| B[FASTA File]
    B -->|create_indexes.py| C[Plasmid Index]
    D[Human FASTA File] -->|create_indexes.py| E[Human Index]
    E -->|spliced_alignment.py| F[Spliced BAM]
    F -->|extract_human_reference.py| G[Spliced Human FASTA]
    G -->|create_indexes.py| H[Spliced Human Index]
    C -->|align_reads.py| I[Plasmid BAM]
    H -->|align_reads.py| J[Spliced Human BAM]
    I -->|compare_alignments.py| K[Comparison Result]
    J -->|compare_alignments.py| K
    K -->|generate_report.py| L[HTML/PDF Report]

License

This project is licensed under the MIT License.