rraadd88 / beditor

A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing, and much more
GNU General Public License v3.0
17 stars 4 forks source link
base-editing crispr genome-wide-targeted-mutagenesis guide-rna-library

beditor(v2)

A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing, and much more

build Issues Downloads GNU License

Usage

πŸ–±οΈ GUI-mode

beditor gui

Note: GUI is recommended for designing small libraries and prioritization of the guides.

▢️ CLI-mode

beditor cli --editor BE1 -m path/to/mutations.tsv -o path/to/output_directory/ --species human --ensembl-release 110
or
beditor cli -c beditor_config.yml
Parameters usage: beditor cli [--editor EDITOR] [-m MUTATIONS_PATH] [-o OUTPUT_DIR_PATH] [--species SPECIES] [--ensembl-release ENSEMBL_RELEASE] [--genome-path GENOME_PATH] [--gtf-path GTF_PATH] [-r RNA_PATH] [-p PRT_PATH] [-c CONFIG_PATH] [--search-window SEARCH_WINDOW] [-n] [-w WD_PATH] [-t THREADS] [-k KERNEL_NAME] [-v VERBOSE] [-i IGV_PATH_PREFIX] [--ext EXT] [-f] [-d] [--skip SKIP] optional arguments: -h, --help show this help message and exit --editor EDITOR base-editing method, available methods can be listed using command: 'beditor resources' -m MUTATIONS_PATH, --mutations-path MUTATIONS_PATH path to the mutation file, the format of which is available at https://github.com/rraadd88/beditor/README.md#Input-format. -o OUTPUT_DIR_PATH, --output-dir-path OUTPUT_DIR_PATH path to the directory where the outputs should be saved. --species SPECIES species name. --ensembl-release ENSEMBL_RELEASE ensemble release number. --genome-path GENOME_PATH path to the genome file, which is not available on Ensembl. --gtf-path GTF_PATH path to the gene annotations file, which is not available on Ensembl. -r RNA_PATH, --rna-path RNA_PATH path to the transcript sequences file, which is not available on Ensembl. -p PRT_PATH, --prt-path PRT_PATH path to the protein sequences file, which is not available on Ensembl. --search-window SEARCH_WINDOW number of bases to search on either side of a target, if not specified, it is inferred by beditor. -n, --not-be False do not process as a base editor. -c CONFIG_PATH, --config-path CONFIG_PATH path to the configuration file. -w WD_PATH, --wd-path WD_PATH path to the working directory. -t THREADS, --threads THREADS 1 number of threads for parallel processing. -k KERNEL_NAME, --kernel-name KERNEL_NAME 'beditor' name of the jupyter kernel. -v VERBOSE, --verbose VERBOSE 'WARNING' verbose, logging levels: DEBUG > INFO > WARNING > ERROR (default) > CRITICAL. -i IGV_PATH_PREFIX, --igv-path-prefix IGV_PATH_PREFIX prefix to be added to the IGV URL. --ext EXT file extensions of the output tables. -f, --force False -d, --dbug False --skip SKIP skip sections of the workflow Examples: Notes: Required parameters for assigning a species: species ensembl_release or genome_path gtf_path rna_path prt_path

Installation

Virtual environment and namming kernel (recommended)

conda env create -n beditor python=3.9;           # options: conda/mamba, python=3.9/3.8
python -m ipykernel install --user --name beditor

Installation of the package

pip install beditor[all]                           

Optional dependencies, as required:

pip install beditor                                # only cli
pip install beditor[gui]                           # plus gui

For fast processing of large genomes (highly recommended for human genome):

conda install install bioconda::ucsc-fatotwobit bioconda::ucsc-twobittofa bioconda::ucsc-twobitinfo # options: conda/mamba

Else, for moderately fast processing,

conda install install bioconda::bedtools           # options: conda/mamba

Input format

Note: The coordinates are 1-based (i.e. X:1-1 instead of X:0:1) and IDs correspond to the chosen genome assemblies (e.g. from Ensembl).

Point mutations

chrom start  end strand mutation
    5  1123 1123 +      C

Position scanning

chrom start  end strand
    5  1123 1123 +     

Region scanning

chrom start  end strand
    5  1123 2123 +     

Protein point mutations

protein id aa pos mutation
  ENSP1123     43        S    

Protein position scanning

protein id aa pos
  ENSP1123     43    

Protein region scanning

protein id aa start aa end
  ENSP1123       43    143

Note: Ensembl protein IDs are used.

Output format

Note: output contains 0-based coordinates are used.

guide sequence          guide locus          offtargets score {columns in the input}
AGCGTTTGGCAAATCAAACAAAA 4:1003215-1003238(+)          0     1 ..

Supported base editing methods

method nucleotide nucleotide mutation window start window end guide length PAM PAM position
A3A-BE3 C T 4 8 20 NGG down
ABE7.10 A G 4 7 20 NGG down
ABE7.10* A G 4 8 20 NGG down
ABE7.9 A G 5 8 20 NGG down
ABESa A G 6 12 21 NNGRRT down
BE-PLUS C T 4 14 20 NGG down
BE1 C T 4 8 20 NGG down
BE2 C T 4 8 20 NGG down
BE3 C T 4 8 20 NGG down
BE4-Gam C T 4 8 20 NGG down
BE4/BE4max C T 4 8 20 NGG down
Cas12a-BE C T 10 12 23 TTTV up
eA3A-BE3 C T 4 8 20 NGG down
EE-BE3 C T 5 6 20 NGG down
HF-BE3 C T 4 8 20 NGG down
Sa(KKH)-ABE A G 6 12 21 NNNRRT down
SA(KKH)-BE3 C T 3 12 21 NNNRRT down
SaBE3 C T 3 12 21 NNGRRT down
SaBE4 C T 3 12 21 NNGRRT down
SaBE4-Gam C T 3 12 21 NNGRRT down
Target-AID C T 2 4 20 NGG down
Target-AID C T 2 4 20 NG down
VQR-ABE A G 4 6 20 NGA down
VQR-BE3 C T 4 11 20 NGAN down
VRER-ABE A G 4 6 20 NGCG down
VRER-BE3 C T 3 10 20 NGCG down
xBE3 C T 4 8 20 NG down
YE1-BE3 C T 5 7 20 NGG down
YE2-BE3 C T 5 6 20 NGG down
YEE-BE3 C T 5 6 20 NGG down

Favorite base editor not listed?
Please send the required info using a PR, or an issue.

Change log

v2

New features:

  1. Design libraries for base or amino acid mutational scanning, at defined positions and regions.
  2. The gui contains library filtering and prioritization options.
  3. Non-base editing applications, e.g. CRISPR-tiling, using not_be option.

Key updates:

  1. Quicker installation due to reduced number of dependencies (bwa comes in the package, and samtools not needed).
  2. Faster run-time, compared to v1, because of the improvements in the dependencies e.g. pandas etc.
  3. Faster run-time on large genomes e.g. human genome, because of the use of 2bit tools.
  4. Direct command line options to use non-model species which e.g. not indexed on Ensembl.
  5. Configuration made optional.

Technical updates:

  1. The gui is powered by mercury, thus overcomming the limitations of v1.
  2. Use of one base editor (method) per run, instead of multiple.
  3. Due to overall faster run-times, parallelization within a run is disabled. However, multiple runs can be parallelized, externally e.g. using Python's built-in multiprocessing.
  4. Only the sgRNAs for which target lies within the optimal activity window are reported. Therefore unneeded penalty for target not being in activity window is now not utilized, but options retained for back-compatibility.
  5. Many refactored functions can now be imported and executed independently for "much more" applications.
  6. Reports generated for each run in the form of a jupyter notebook.
  7. Automated testing on GitHub for continuous integration.
  8. The cli is compatible with python 3.8 and 3.9 (even higher untested versions), however the gui not supported on python 3.7 due lack of dependencies.

Future directions, for which contributions are welcome:

Similar projects:

How to cite?

v2

  1. Using BibTeX:

    @software{Dandage_beditor,
    title   = {beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing},
    author  = {Dandage, Rohan},
    year    = {2024},
    url     = {https://doi.org/10.5281/zenodo.10648264},
    version = {v2.0.1},
    note    = {The URL is a DOI link to the permanent archive of the software.},
    }
  2. DOI link: DOI, or

  3. Using citation information from CITATION.CFF file.

v1 1. Using BibTeX: ``` @software{Dandage_beditorv1, title = {beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing}, author = {Dandage, Rohan}, year = {2019}, url = {https://doi.org/10.1534/genetics.119.302089}, version = {v1}, } ```

Future directions, for which contributions are welcome:

Similar projects:

module beditor.lib.get_mutations

Mutation co-ordinates using pyensembl


function get_protein_cds_coords

get_protein_cds_coords(annots, protein_id: str) β†’ DataFrame

Get protein CDS coordinates

Args:

Returns:


function get_protein_mutation_coords

get_protein_mutation_coords(data: DataFrame, aapos: int, test=False) β†’ tuple

Get protein mutation coordinates

Args:

Raises:

Returns:


function map_coords

map_coords(df_: DataFrame, df1_: DataFrame, verbose: bool = False) β†’ DataFrame

Map coordinates

Args:

Returns:


function get_mutation_coords_protein

get_mutation_coords_protein(
    df0: DataFrame,
    annots,
    search_window: int,
    outd: str = None,
    force: bool = False,
    verbose: bool = False
) β†’ DataFrame

Get mutation coordinates for protein

Args:

Returns:


function get_mutation_coords

get_mutation_coords(
    df0: DataFrame,
    annots,
    search_window: int,
    verbose: bool = False,
    **kws_protein
) β†’ DataFrame

Get mutation coordinates

Args:

Returns:

module beditor.lib.get_scores

Scores


function get_ppamdist

get_ppamdist(
    guide_length: int,
    pam_len: int,
    pam_pos: str,
    ppamdist_min: int
) β†’ DataFrame

Get penalties set based on distances of the mismatch/es from PAM

:param guide_length: length of guide sequence :param pam_len: length of PAM sequence :param pam_pos: PAM location 3' or 5' :param ppamdist_min: minimum penalty :param pmutatpam: penalty for mismatch at PAM

TODOs: Use different scoring function for different methods.


function get_beditorscore_per_alignment

get_beditorscore_per_alignment(
    NM: int,
    alignment: str,
    pam_len: int,
    pam_pos: str,
    pentalty_genic: float = 0.5,
    pentalty_intergenic: float = 0.9,
    pentalty_dist_from_pam: float = 0.1,
    verbose: bool = False
) β†’ float

Calculates beditor score per alignment between guide and genomic DNA.

:param NM: Hamming distance :param mismatches_max: Maximum mismatches allowed in alignment :param alignment: Symbol '|' means a match, '.' means mismatch and ' ' means gap. e.g. |||||.||||||||||.||||.| :param pentalty_genic: penalty for genic alignment :param pentalty_intergenic: penalty for intergenic alignment :param pentalty_dist_from_pam: maximum pentalty for a mismatch at PAM () :returns: beditor score per alignment.


function get_beditorscore_per_guide

get_beditorscore_per_guide(
    guide_seq: str,
    strategy: str,
    align_seqs_scores: DataFrame,
    dBEs: DataFrame,
    penalty_activity_window: float = 0.5,
    test: bool = False
) β†’ float

Calculates beditor score per guide.

:param guide_seq: guide seqeunce 23nts :param strategy: strategy string eg. ABE;+;@-14;ACT:GCT;T:A; :param align_seqs_scores: list of beditor scores per alignments for all the alignments between guide and genomic DNA :param penalty_activity_window: if editable base is not in activity window, penalty_activity_window=0.5 :returns: beditor score per guide.


function revcom

revcom(s)

function calc_cfd

calc_cfd(wt, sg, pam)

function get_cfdscore

get_cfdscore(wt, off)

module beditor.lib.get_specificity

Specificities


function run_alignment

run_alignment(
    src_path: str,
    genomep: str,
    guidesfap: str,
    guidessamp: str,
    guidel: int,
    mismatches_max: int = 2,
    threads: int = 1,
    force: bool = False,
    verbose: bool = False
) β†’ str

Run alignment

Args:

Returns:


function read_sam

read_sam(align_path: str) β†’ DataFrame

read alignment file

Args:

Returns:

Notes:

Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds Reference: https://bio-bwa.sourceforge.net/bwa.shtml


function parse_XA

parse_XA(XA: str) β†’ DataFrame

Parse XA tags

Args:

Notes:

format: (chr,pos,CIGAR,NM;)

Example: XA='4,+908051,23M,0;4,+302823,23M,0;4,-183556,23M,0;4,+1274932,23M,0;4,+207765,23M,0;4,+456906,23M,0;4,-1260135,23M,0;4,+454215,23M,0;4,-1177442,23M,0;4,+955254,23M,1;4,+1167921,23M,1;4,-613257,23M,1;4,+857893,23M,1;4,-932678,23M,2;4,-53825,23M,2;4,+306783,23M,2;'


function get_extra_alignments

get_extra_alignments(
    df1: DataFrame,
    genome: str,
    bed_path: str,
    alignments_max: int = 10,
    threads: int = 1
) β†’ DataFrame

Get extra alignments

Args:

Returns:

TODOs: 1. apply parallel processing to get_seq


function to_pam_coord

to_pam_coord(
    pam_pos: str,
    pam_len: int,
    align_start: int,
    align_end: int,
    strand: str
) β†’ tuple

Get PAM coords

Args:

Returns:


function get_alignments

get_alignments(
    align_path: str,
    genome: str,
    alignments_max: int,
    pam_pos: str,
    pam_len: int,
    guide_len: int,
    pam_pattern: str,
    pam_bed_path: str,
    extra_bed_path: str,
    **kws_xa
) β†’ DataFrame

Get alignments

Args:

Returns:


function get_penalties

get_penalties(
    aligns: DataFrame,
    guides: DataFrame,
    annots: DataFrame
) β†’ DataFrame

Get penalties

Args:

Returns:


function score_alignments

score_alignments(
    df4: DataFrame,
    pam_len: int,
    pam_pos: str,
    pentalty_genic: float = 0.5,
    pentalty_intergenic: float = 0.9,
    pentalty_dist_from_pam: float = 0.1,
    verbose: bool = False
) β†’ tuple

score_alignments summary

Args:

Returns:

Note:

  1. Low value corresponds to high penalty and vice versa, because values are multiplied. 2. High penalty means consequential offtarget alignment and vice versa.

function score_guides

score_guides(
    guides: DataFrame,
    scores: DataFrame,
    not_be: bool = False
) β†’ DataFrame

Score guides

Args:

Returns:

Changes: penalty_activity_window disabled as only the sgRNAs with target in the window are reported.

module beditor.lib.io

Input/Output


function download_annots

download_annots(species_name: str, release: int) β†’ bool

Download annotations using pyensembl

Args:

Returns:


function cache_subdirectory

cache_subdirectory(
    reference_name: str = None,
    annotation_name: str = None,
    annotation_version: int = None,
    CACHE_BASE_SUBDIR: str = 'beditor'
) β†’ str

Which cache subdirectory to use for a given annotation database over a particular reference. All arguments can be omitted to just get the base subdirectory for all pyensembl cached datasets.

Args:

Returns:


function cached_path

cached_path(path_or_url: str, cache_directory_path: str)

When downloading remote files, the default behavior is to name local files the same as their remote counterparts.


function to_downloaded_cached_path

to_downloaded_cached_path(
    url: str,
    annots=None,
    reference_name: str = None,
    annotation_name: str = 'ensembl',
    ensembl_release: str = None,
    CACHE_BASE_SUBDIR: str = 'pyensembl'
) β†’ str

To downloaded cached path

Args:

Returns:


function download_genome

download_genome(
    species: str,
    ensembl_release: int,
    force: bool = False,
    verbose: bool = False
) β†’ str

Download genome

Args:

Returns:


function read_genome

read_genome(genome_path: str, fast=True)

Read genome

Args:


function to_fasta

to_fasta(
    sequences: dict,
    output_path: str,
    molecule_type: str,
    force: bool = True,
    **kws_SeqRecord
) β†’ str

Save fasta file.

Args:

Returns:


function to_2bit

to_2bit(
    genome_path: str,
    src_path: str = None,
    force: bool = False,
    verbose: bool = False
) β†’ str

To 2bit

Args:

Returns:


function to_fasta_index

to_fasta_index(
    genome_path: str,
    bgzip: bool = False,
    bgzip_path: str = None,
    threads: int = 1,
    verbose: bool = True,
    force: bool = False,
    indexed: bool = False
) β†’ str

To fasta index

Args:

Returns:


function to_bed

to_bed(
    df: DataFrame,
    outp: str,
    cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) β†’ str

To bed path

Args:

Returns:


function read_bed

read_bed(
    p: str,
    cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) β†’ DataFrame

Read bed file

Args:

Returns:


function to_viz_inputs

to_viz_inputs(
    gtf_path: str,
    genome_path: str,
    output_dir_path: str,
    output_ext: str = 'tsv',
    threads: int = 1,
    force: bool = False
) β†’ dict

To viz inputs for the IGV

Args:

Returns:


function to_igv_path_prefix

to_igv_path_prefix() β†’ str

Get IGV path prefix

Returns:


function to_session_path

to_session_path(p: str, path_prefix: str = None, outp: str = None) β†’ str

To session path

Args:

Returns:


function read_cytobands

read_cytobands(
    cytobands_path: str,
    col_chrom: str = 'chromosome',
    remove_prefix: str = 'chr'
) β†’ DataFrame

Read cytobands

Args:

Returns:


function to_output

to_output(inputs: DataFrame, guides: DataFrame, scores: DataFrame) β†’ DataFrame

To output table

Args:

Returns:

module beditor.lib.make_guides

Designing the sgRNAs


function get_guide_pam

get_guide_pam(
    match: str,
    pam_stream: str,
    guidel: int,
    seq: str,
    pos_codon: int = None
)

function get_pam_searches

get_pam_searches(dpam: DataFrame, seq: str, pos_codon: int) β†’ DataFrame

Search PAM occurance

:param dpam: dataframe with PAM sequences :param seq: target sequence :param pos_codon: reading frame :param test: debug mode on :returns dpam_searches: dataframe with positions of pams


function get_guides

get_guides(
    data: DataFrame,
    dpam: DataFrame,
    guide_len: int,
    base_fraction_max: float = 0.8
) β†’ DataFrame

Get guides

Args:

Returns:


function to_locusby_pam

to_locusby_pam(
    chrom: str,
    pam_start: int,
    pam_end: int,
    pam_position: str,
    strand: str,
    length: int,
    start_off: int = 0
) β†’ str

To locus by PAM from PAM coords.

Args:

Returns:


function to_pam_coord

to_pam_coord(
    startf: int,
    endf: int,
    startp: int,
    endp: int,
    strand: str
) β†’ tuple

To PAM coordinates

Args:

Returns:


function get_distances

get_distances(df2: DataFrame, df3: DataFrame, cfg_method: dict) β†’ DataFrame

Get distances

Args:

Returns:


function get_windows_seq

get_windows_seq(s: str, l: str, wl: str, verbose: bool = False) β†’ str

Sequence by guide strand

Args:

Returns:


function filter_guides

filter_guides(
    df1: DataFrame,
    cfg_method: dict,
    verbose: bool = False
) β†’ DataFrame

Filter sgRNAs

Args:

Returns:


function get_window_target_overlap

get_window_target_overlap(
    tstart: int,
    tend: int,
    wl: str,
    ws: str,
    nt: str,
    verbose: bool = False
) β†’ tuple

Get window target overlap

Args:

Returns:


function get_mutated_codon

get_mutated_codon(
    ts: str,
    tl: str,
    tes: str,
    tel: str,
    strand: str,
    verbose: bool = False
) β†’ str

Get mutated codon

Args:

Returns:


function get_coedits_base

get_coedits_base(
    ws: str,
    wl: str,
    wts: str,
    wtl: str,
    nt: str,
    verbose: bool = False
) β†’ str

Get co-edited bases

Args:

Returns:

module beditor.lib

module beditor.lib.methods

Global Variables


function dpam2dpam_strands

dpam2dpam_strands(dpam: DataFrame, pams: list) β†’ DataFrame

Duplicates dpam dataframe to be compatible for searching PAMs on - strand

Args:

Returns:


function get_be2dpam

get_be2dpam(
    din: DataFrame,
    methods: list = None,
    test: bool = False,
    cols_dpam: list = ['PAM', 'PAM position', 'guide length']
) β†’ dict

Make BE to dpam mapping i.e. dict

Args:

Returns:

module beditor.lib.utils

Utilities

Global Variables


function get_src_path

get_src_path() β†’ str

Get the beditor source directory path.

Returns:


function runbashcmd

runbashcmd(cmd: str, test: bool = False, logf=None)

Run a bash command

Args:


function log_time_elapsed

log_time_elapsed(start)

Log time elapsed.

Args:

Returns:


function rescale

rescale(
    a: <built-in function array>,
    mn: float = None
) β†’ <built-in function array>

Rescale a vector.

Args:

Returns:


function get_nt2complement

get_nt2complement()

function s2re

s2re(s: str, ss2re: dict) β†’ str

String to regex patterns

Args:

Returns:


function parse_locus

parse_locus(s: str, zero_based: bool = True) β†’ tuple

parse_locus summary

Args:

Returns:

Notes:

beditor outputs (including bed files) use 0-based loci pyensembl and IGV use 1-based locations


function get_pos

get_pos(s: str, l: str, reverse: bool = True, zero_based: bool = True) β†’ Series

Expand locus to positions mapped to nucleotides.

Args:

Returns:


function get_seq

get_seq(
    genome: str,
    contig: str,
    start: int,
    end: int,
    strand: str,
    out_type: str = 'str',
    verbose: bool = False
) β†’ str

Extract a sequence from a genome file based on start and end positions using streaming.

Args:

Raises:

Returns:


function read_fasta

read_fasta(
    fap: str,
    key_type: str = 'id',
    duplicates: bool = False,
    out_type='dict'
) β†’ dict

Read fasta

Args:

Returns:

Notes:

  1. If duplicates key_type is set to description instead of id.

function format_coords

format_coords(df: DataFrame) β†’ DataFrame

Format coordinates

Args:

Returns:


function fetch_sequences_bp

fetch_sequences_bp(p: str, genome: str) β†’ DataFrame

Fetch sequences using biopython.

Args:

Returns:


function fetch_sequences

fetch_sequences(
    p: str,
    genome_path: str,
    outp: str = None,
    src_path: str = None,
    revcom: bool = True,
    method='2bit',
    out_type='df'
) β†’ DataFrame

Fetch sequences

Args:

Returns:


function get_sequences

get_sequences(
    df1: DataFrame,
    p: str,
    genome_path: str,
    outp: str = None,
    src_path: str = None,
    revcom: bool = True,
    out_type: str = 'df',
    renames: dict = {},
    **kws_fetch_sequences
) β†’ DataFrame

Get sequences for the loci in a table

Args:

Returns:

Notes:

Input is 1-based Output is 0-based Saves bed file and gets the sequences


function to_locus

to_locus(
    chrom: str = 'chrom',
    start: str = 'start',
    end: str = 'end',
    strand: str = 'strand',
    x: Series = None
) β†’ str

To locus

Args:

Returns:


function get_flanking_seqs

get_flanking_seqs(
    df1: DataFrame,
    targets_path: str,
    flanks_path: str,
    genome: str = None,
    search_window: list = None
) β†’ DataFrame

Get flanking sequences

Args:

Returns:


function get_strand

get_strand(
    genome,
    df1: DataFrame,
    col_start: str,
    col_end: str,
    col_chrom: str,
    col_strand: str,
    col_seq: str
) β†’ DataFrame

Get strand by comparing the aligned and fetched sequence

Args:

Returns:

Notes:

used for tests.


function reverse_complement_multintseq

reverse_complement_multintseq(seq: str, nt2complement: dict) β†’ str

Reverse complement multi-nucleotide sequence

Args:

Returns:


function reverse_complement_multintseqreg

reverse_complement_multintseqreg(
    seq: str,
    multint2regcomplement: dict,
    nt2complement: dict
) β†’ str

Reverse complement multi-nucleotide regex patterns

Args:

Returns:


function hamming_distance

hamming_distance(s1: str, s2: str) β†’ int

Return the Hamming distance between equal-length sequences

Args:

Raises:

Returns:


function align

align(
    q: str,
    s: str,
    test: bool = False,
    psm: float = 2,
    pmm: float = 0.5,
    pgo: float = -3,
    pge: float = -1
) β†’ str

Creates pairwise local alignment between seqeunces.

Args:

Returns:

Notes:

REF: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html The match parameters are: CODE DESCRIPTION x No parameters. Identical characters have score of 1, otherwise 0. m A match score is the score of identical chars, otherwise mismatch score. d A dictionary returns the score of any pair of characters. c A callback function returns scores. The gap penalty parameters are: CODE DESCRIPTION x No gap penalties. s Same open and extend gap penalties for both sequences. d The sequences have different open and extend gap penalties. c A callback function returns the gap penalties.


function get_orep

get_orep(seq: str) β†’ int

Get the overrepresentation


function get_polyt_length

get_polyt_length(s: str) β†’ int

Counts the length of the longest polyT stretch (RNA pol3 terminator) in sequence

:param s: sequence in string format


function get_annots_installed

get_annots_installed() β†’ DataFrame

Get a list of annotations installed.

Returns:


function get_annots

get_annots(
    species_name: str = None,
    release: int = None,
    gtf_path: str = None,
    transcript_path: str = None,
    protein_path: str = None,
    reference_name: str = 'assembly',
    annotation_name: str = 'source',
    verbose: bool = False,
    **kws_Genome
)

Get pyensembl annotation instance

Args:

Returns: pyensembl annotation instance


function to_pid

to_pid(annots, gid: str) β†’ str

To protein ID

Args:

Returns:


function to_one_based_coordinates

to_one_based_coordinates(df: DataFrame) β†’ DataFrame

To one based coordinates

Args:

Returns:

module beditor.lib.viz

Visualizations.


function to_igv

to_igv(
    cfg: dict = None,
    gtf_path: str = None,
    genome_path: str = None,
    output_dir_path: str = None,
    threads: int = 1,
    output_ext: str = None,
    force: bool = False
) β†’ str

To IGV session file.

Args:

Returns:


function get_nt_composition

get_nt_composition(seqs: list) β†’ DataFrame

Get nt composition.

Args:

Returns:


function plot_ntcompos

plot_ntcompos(
    seqs: list,
    pam_pos: str,
    pam_len: int,
    window: list = None,
    ax: Axes = None,
    color_pam: str = 'lime',
    color_window: str = 'gold'
) β†’ Axes

Plot nucleotide composition

Args:

Returns:


function plot_ontarget

plot_ontarget(
    guide_loc: str,
    pam_pos: str,
    pam_len: int,
    guidepam_seq: str,
    window: list = None,
    show_title: bool = False,
    figsize: list = [10, 2],
    verbose: bool = False,
    kws_sg: dict = {}
) β†’ Axes

plot_ontarget summary

Args:

Returns:

TODOs: 1. convert to 1-based coordinates 2. features from the GTF file


function get_plot_inputs

get_plot_inputs(df2: DataFrame) β†’ list

Get plot inputs.

Args:

Returns:


function plot_library_stats

plot_library_stats(
    dfs: list,
    palette: dict = {True: 'b', False: 'lightgray'},
    cutoffs: dict = None,
    not_be: bool = True,
    dbug: bool = False,
    figsize: list = [10, 2.5]
) β†’ list

Plot library stats

Args:

Returns:

module beditor.run

Command-line options


function validate_params

validate_params(parameters: dict) β†’ bool

Validate the parameters.

Args:

Returns:


function cli

cli(
    editor: str = None,
    mutations_path: str = None,
    output_dir_path: str = None,
    species: str = None,
    ensembl_release: int = None,
    genome_path: str = None,
    gtf_path: str = None,
    rna_path: str = None,
    prt_path: str = None,
    search_window: int = None,
    not_be: bool = False,
    config_path: str = None,
    wd_path: str = None,
    threads: int = 1,
    kernel_name: str = 'beditor',
    verbose='WARNING',
    igv_path_prefix=None,
    ext: str = None,
    force: bool = False,
    dbug: bool = False,
    skip=None,
    **kws
)

beditor command-line (CLI)

Args:

Examples: beditor cli -c inputs/mutations/protein/positions.yml

Notes:

Required parameters for a run: editor mutations_path output_dir_path or config_path


function gui

gui()

function resources

resources()