`beditor`(v2)

A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing, and much more

Usage

🖱️ GUI-mode

beditor gui

Note: GUI is recommended for designing small libraries and prioritization of the guides.

▶️ CLI-mode

beditor cli --editor BE1 -m path/to/mutations.tsv -o path/to/output_directory/ --species human --ensembl-release 110
or
beditor cli -c beditor_config.yml

Parameters

usage: beditor cli [--editor EDITOR] [-m MUTATIONS_PATH] [-o OUTPUT_DIR_PATH] [--species SPECIES] [--ensembl-release ENSEMBL_RELEASE] [--genome-path GENOME_PATH] [--gtf-path GTF_PATH] [-r RNA_PATH] [-p PRT_PATH] [-c CONFIG_PATH] [--search-window SEARCH_WINDOW] [-n] [-w WD_PATH] [-t THREADS] [-k KERNEL_NAME] [-v VERBOSE] [-i IGV_PATH_PREFIX] [--ext EXT] [-f] [-d] [--skip SKIP] optional arguments: -h, --help show this help message and exit --editor EDITOR base-editing method, available methods can be listed using command: 'beditor resources' -m MUTATIONS_PATH, --mutations-path MUTATIONS_PATH path to the mutation file, the format of which is available at https://github.com/rraadd88/beditor/README.md#Input-format. -o OUTPUT_DIR_PATH, --output-dir-path OUTPUT_DIR_PATH path to the directory where the outputs should be saved. --species SPECIES species name. --ensembl-release ENSEMBL_RELEASE ensemble release number. --genome-path GENOME_PATH path to the genome file, which is not available on Ensembl. --gtf-path GTF_PATH path to the gene annotations file, which is not available on Ensembl. -r RNA_PATH, --rna-path RNA_PATH path to the transcript sequences file, which is not available on Ensembl. -p PRT_PATH, --prt-path PRT_PATH path to the protein sequences file, which is not available on Ensembl. --search-window SEARCH_WINDOW number of bases to search on either side of a target, if not specified, it is inferred by beditor. -n, --not-be False do not process as a base editor. -c CONFIG_PATH, --config-path CONFIG_PATH path to the configuration file. -w WD_PATH, --wd-path WD_PATH path to the working directory. -t THREADS, --threads THREADS 1 number of threads for parallel processing. -k KERNEL_NAME, --kernel-name KERNEL_NAME 'beditor' name of the jupyter kernel. -v VERBOSE, --verbose VERBOSE 'WARNING' verbose, logging levels: DEBUG > INFO > WARNING > ERROR (default) > CRITICAL. -i IGV_PATH_PREFIX, --igv-path-prefix IGV_PATH_PREFIX prefix to be added to the IGV URL. --ext EXT file extensions of the output tables. -f, --force False -d, --dbug False --skip SKIP skip sections of the workflow Examples: Notes: Required parameters for assigning a species: species ensembl_release or genome_path gtf_path rna_path prt_path

Installation

Virtual environment and namming kernel (recommended)

conda env create -n beditor python=3.9;           # options: conda/mamba, python=3.9/3.8
python -m ipykernel install --user --name beditor

Installation of the package

pip install beditor[all]

Optional dependencies, as required:

pip install beditor                                # only cli
pip install beditor[gui]                           # plus gui

For fast processing of large genomes (highly recommended for human genome):

conda install install bioconda::ucsc-fatotwobit bioconda::ucsc-twobittofa bioconda::ucsc-twobitinfo # options: conda/mamba

Else, for moderately fast processing,

conda install install bioconda::bedtools           # options: conda/mamba

Input format

Note: The coordinates are 1-based (i.e. X:1-1 instead of X:0:1) and IDs correspond to the chosen genome assemblies (e.g. from Ensembl).

Point mutations

chrom start  end strand mutation
    5  1123 1123 +      C

Position scanning

chrom start  end strand
    5  1123 1123 +

Region scanning

chrom start  end strand
    5  1123 2123 +

Protein point mutations

protein id aa pos mutation
  ENSP1123     43        S

Protein position scanning

protein id aa pos
  ENSP1123     43

Protein region scanning

protein id aa start aa end
  ENSP1123       43    143

Note: Ensembl protein IDs are used.

Output format

Note: output contains 0-based coordinates are used.

guide sequence          guide locus          offtargets score {columns in the input}
AGCGTTTGGCAAATCAAACAAAA 4:1003215-1003238(+)          0     1 ..

Supported base editing methods

method	nucleotide	nucleotide mutation	window start	window end	guide length	PAM	PAM position
A3A-BE3	C	T	4	8	20	NGG	down
ABE7.10	A	G	4	7	20	NGG	down
ABE7.10*	A	G	4	8	20	NGG	down
ABE7.9	A	G	5	8	20	NGG	down
ABESa	A	G	6	12	21	NNGRRT	down
BE-PLUS	C	T	4	14	20	NGG	down
BE1	C	T	4	8	20	NGG	down
BE2	C	T	4	8	20	NGG	down
BE3	C	T	4	8	20	NGG	down
BE4-Gam	C	T	4	8	20	NGG	down
BE4/BE4max	C	T	4	8	20	NGG	down
Cas12a-BE	C	T	10	12	23	TTTV	up
eA3A-BE3	C	T	4	8	20	NGG	down
EE-BE3	C	T	5	6	20	NGG	down
HF-BE3	C	T	4	8	20	NGG	down
Sa(KKH)-ABE	A	G	6	12	21	NNNRRT	down
SA(KKH)-BE3	C	T	3	12	21	NNNRRT	down
SaBE3	C	T	3	12	21	NNGRRT	down
SaBE4	C	T	3	12	21	NNGRRT	down
SaBE4-Gam	C	T	3	12	21	NNGRRT	down
Target-AID	C	T	2	4	20	NGG	down
Target-AID	C	T	2	4	20	NG	down
VQR-ABE	A	G	4	6	20	NGA	down
VQR-BE3	C	T	4	11	20	NGAN	down
VRER-ABE	A	G	4	6	20	NGCG	down
VRER-BE3	C	T	3	10	20	NGCG	down
xBE3	C	T	4	8	20	NG	down
YE1-BE3	C	T	5	7	20	NGG	down
YE2-BE3	C	T	5	6	20	NGG	down
YEE-BE3	C	T	5	6	20	NGG	down

Favorite base editor not listed?
Please send the required info using a PR, or an issue.

Change log

v2

New features:

Design libraries for base or amino acid mutational scanning, at defined positions and regions.
The gui contains library filtering and prioritization options.
Non-base editing applications, e.g. CRISPR-tiling, using not_be option.

Key updates:

Quicker installation due to reduced number of dependencies (bwa comes in the package, and samtools not needed).
Faster run-time, compared to v1, because of the improvements in the dependencies e.g. pandas etc.
Faster run-time on large genomes e.g. human genome, because of the use of 2bit tools.
Direct command line options to use non-model species which e.g. not indexed on Ensembl.
Configuration made optional.

Technical updates:

The gui is powered by mercury, thus overcomming the limitations of v1.
Use of one base editor (method) per run, instead of multiple.
Due to overall faster run-times, parallelization within a run is disabled. However, multiple runs can be parallelized, externally e.g. using Python's built-in multiprocessing.
Only the sgRNAs for which target lies within the optimal activity window are reported. Therefore unneeded penalty for target not being in activity window is now not utilized, but options retained for back-compatibility.
Many refactored functions can now be imported and executed independently for "much more" applications.
Reports generated for each run in the form of a jupyter notebook.
Automated testing on GitHub for continuous integration.
The cli is compatible with python 3.8 and 3.9 (even higher untested versions), however the gui not supported on python 3.7 due lack of dependencies.

Future directions, for which contributions are welcome:

[ ] Adding option to provide 0-based co-ordinates in the input.

Similar projects:

How to cite?

v2

Using BibTeX:

@software{Dandage_beditor,
title   = {beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing},
author  = {Dandage, Rohan},
year    = {2024},
url     = {https://doi.org/10.5281/zenodo.10648264},
version = {v2.0.1},
note    = {The URL is a DOI link to the permanent archive of the software.},
}

DOI link: , or
Using citation information from CITATION.CFF file.

1. Using BibTeX: ``` @software{Dandage_beditorv1, title = {beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing}, author = {Dandage, Rohan}, year = {2019}, url = {https://doi.org/10.1534/genetics.119.302089}, version = {v1}, } ```

Future directions, for which contributions are welcome:

[ ] Allowing 0-based coordinates in the input.

Similar projects:

`module` `beditor.lib.get_mutations`

Mutation co-ordinates using pyensembl

`function` `get_protein_cds_coords`

get_protein_cds_coords(annots, protein_id: str) → DataFrame

Get protein CDS coordinates

Args:

annots: pyensembl annotations
protein_id (str): protein ID

Returns:

pd.DataFrame: output table

`function` `get_protein_mutation_coords`

get_protein_mutation_coords(data: DataFrame, aapos: int, test=False) → tuple

Get protein mutation coordinates

Args:

data (pd.DataFrame): input table
aapos (int): amino acid position
test (bool, optional): test-mode. Defaults to False.

Raises:

ValueError: invalid positions

Returns:

tuple: aapos,start,end,seq

`function` `map_coords`

map_coords(df_: DataFrame, df1_: DataFrame, verbose: bool = False) → DataFrame

Map coordinates

Args:

df_ (pd.DataFrame): input table

Returns:

pd.DataFrame: output table

`function` `get_mutation_coords_protein`

get_mutation_coords_protein(
    df0: DataFrame,
    annots,
    search_window: int,
    outd: str = None,
    force: bool = False,
    verbose: bool = False
) → DataFrame

Get mutation coordinates for protein

Args:

df0 (pd.DataFrame): input table
annots (type): pyensembl annotations
search_window (int): search window length on either side of the target
outd (str, optional): output directory path. Defaults to None.
force (bool, optional): force. Defaults to False.
verbose (bool, optional): verbose. Defaults to False.

Returns:

pd.DataFrame: output table

`function` `get_mutation_coords`

get_mutation_coords(
    df0: DataFrame,
    annots,
    search_window: int,
    verbose: bool = False,
    **kws_protein
) → DataFrame

Get mutation coordinates

Args:

df0 (pd.DataFrame): input table
annots (type): pyensembl annotation
search_window (int): search window length on either side of the target
verbose (bool, optional): verbose. Defaults to False.

Returns:

pd.DataFrame: output table

`module` `beditor.lib.get_scores`

Scores

`function` `get_ppamdist`

get_ppamdist(
    guide_length: int,
    pam_len: int,
    pam_pos: str,
    ppamdist_min: int
) → DataFrame

Get penalties set based on distances of the mismatch/es from PAM

:param guide_length: length of guide sequence :param pam_len: length of PAM sequence :param pam_pos: PAM location 3' or 5' :param ppamdist_min: minimum penalty :param pmutatpam: penalty for mismatch at PAM

TODOs: Use different scoring function for different methods.

`function` `get_beditorscore_per_alignment`

get_beditorscore_per_alignment(
    NM: int,
    alignment: str,
    pam_len: int,
    pam_pos: str,
    pentalty_genic: float = 0.5,
    pentalty_intergenic: float = 0.9,
    pentalty_dist_from_pam: float = 0.1,
    verbose: bool = False
) → float

Calculates beditor score per alignment between guide and genomic DNA.

:param NM: Hamming distance :param mismatches_max: Maximum mismatches allowed in alignment :param alignment: Symbol '|' means a match, '.' means mismatch and ' ' means gap. e.g. |||||.||||||||||.||||.| :param pentalty_genic: penalty for genic alignment :param pentalty_intergenic: penalty for intergenic alignment :param pentalty_dist_from_pam: maximum pentalty for a mismatch at PAM () :returns: beditor score per alignment.

`function` `get_beditorscore_per_guide`

get_beditorscore_per_guide(
    guide_seq: str,
    strategy: str,
    align_seqs_scores: DataFrame,
    dBEs: DataFrame,
    penalty_activity_window: float = 0.5,
    test: bool = False
) → float

Calculates beditor score per guide.

:param guide_seq: guide seqeunce 23nts :param strategy: strategy string eg. ABE;+;@-14;ACT:GCT;T:A; :param align_seqs_scores: list of beditor scores per alignments for all the alignments between guide and genomic DNA :param penalty_activity_window: if editable base is not in activity window, penalty_activity_window=0.5 :returns: beditor score per guide.

`function` `revcom`

revcom(s)

`function` `calc_cfd`

calc_cfd(wt, sg, pam)

`function` `get_cfdscore`

get_cfdscore(wt, off)

`module` `beditor.lib.get_specificity`

Specificities

`function` `run_alignment`

run_alignment(
    src_path: str,
    genomep: str,
    guidesfap: str,
    guidessamp: str,
    guidel: int,
    mismatches_max: int = 2,
    threads: int = 1,
    force: bool = False,
    verbose: bool = False
) → str

Run alignment

Args:

src_path (str): source path
genomep (str): genome path
guidesfap (str): guide fasta path
guidessamp (str): guide sam path
threads (int, optional): threads. Defaults to 1.
force (bool, optional): force. Defaults to False.
verbose (bool, optional): verbose. Defaults to False.

Returns:

str: alignment file.

`function` `read_sam`

read_sam(align_path: str) → DataFrame

read alignment file

Args:

align_path (str): path to the alignment file

Returns:

pd.DataFrame: output table

Notes:

Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds Reference: https://bio-bwa.sourceforge.net/bwa.shtml

`function` `parse_XA`

parse_XA(XA: str) → DataFrame

Parse XA tags

Args:

XA (str): XA tag

Notes:

format: (chr,pos,CIGAR,NM;)

Example: XA='4,+908051,23M,0;4,+302823,23M,0;4,-183556,23M,0;4,+1274932,23M,0;4,+207765,23M,0;4,+456906,23M,0;4,-1260135,23M,0;4,+454215,23M,0;4,-1177442,23M,0;4,+955254,23M,1;4,+1167921,23M,1;4,-613257,23M,1;4,+857893,23M,1;4,-932678,23M,2;4,-53825,23M,2;4,+306783,23M,2;'

`function` `get_extra_alignments`

get_extra_alignments(
    df1: DataFrame,
    genome: str,
    bed_path: str,
    alignments_max: int = 10,
    threads: int = 1
) → DataFrame

Get extra alignments

Args:

df1 (pd.DataFrame): input table
alignments_max (int, optional): alignments max. Defaults to 10.
threads (int, optional): threads. Defaults to 1.

Returns:

pd.DataFrame: output table

TODOs: 1. apply parallel processing to get_seq

`function` `to_pam_coord`

to_pam_coord(
    pam_pos: str,
    pam_len: int,
    align_start: int,
    align_end: int,
    strand: str
) → tuple

Get PAM coords

Args:

pam_pos (str): PAM position
pam_len (int): PAM length
align_start (int): alignment start
align_end (int): alignment end
strand (str): strand

Returns:

tuple: start,end

`function` `get_alignments`

get_alignments(
    align_path: str,
    genome: str,
    alignments_max: int,
    pam_pos: str,
    pam_len: int,
    guide_len: int,
    pam_pattern: str,
    pam_bed_path: str,
    extra_bed_path: str,
    **kws_xa
) → DataFrame

Get alignments

Args:

align_path (str): alignement path
genome (str): genome path
pam_pos (str): PAM position
pam_len (int): PAM length
guide_len (int): sgRNA length
pam_pattern (str): PAM pattern
pam_bed_path (str): PAM bed path

Returns:

pd.DataFrame: output path

`function` `get_penalties`

get_penalties(
    aligns: DataFrame,
    guides: DataFrame,
    annots: DataFrame
) → DataFrame

Get penalties

Args:

aligns (pd.DataFrame): alignements
guides (pd.DataFrame): guides
annots (pd.DataFrame): annotations

Returns:

pd.DataFrame: output table

`function` `score_alignments`

score_alignments(
    df4: DataFrame,
    pam_len: int,
    pam_pos: str,
    pentalty_genic: float = 0.5,
    pentalty_intergenic: float = 0.9,
    pentalty_dist_from_pam: float = 0.1,
    verbose: bool = False
) → tuple

score_alignments summary

Args:

df4 (pd.DataFrame): input table
pam_pos (str): PAM position
pentalty_genic (float, optional): penalty for offtarget in genic locus. Defaults to 0.5.
pentalty_intergenic (float, optional): penalty for offtarget in intergenic locus. Defaults to 0.9.
pentalty_dist_from_pam (float, optional): penalty for offtarget wrt distance from PAM. Defaults to 0.1.
verbose (bool, optional): verbose. Defaults to False.

Returns:

tuple: tables

Note:

Low value corresponds to high penalty and vice versa, because values are multiplied. 2. High penalty means consequential offtarget alignment and vice versa.

`function` `score_guides`

score_guides(
    guides: DataFrame,
    scores: DataFrame,
    not_be: bool = False
) → DataFrame

Score guides

Args:

guides (pd.DataFrame): guides
scores (pd.DataFrame): scores
not_be (bool, optional): not a base editor. Defaults to False.

Returns:

pd.DataFrame: output table

Changes: penalty_activity_window disabled as only the sgRNAs with target in the window are reported.

`module` `beditor.lib.io`

Input/Output

`function` `download_annots`

download_annots(species_name: str, release: int) → bool

Download annotations using pyensembl

Args:

species_name (str): species name
release (int): release number

Returns:

bool: whether annotation is downloaded or not

`function` `cache_subdirectory`

cache_subdirectory(
    reference_name: str = None,
    annotation_name: str = None,
    annotation_version: int = None,
    CACHE_BASE_SUBDIR: str = 'beditor'
) → str

Which cache subdirectory to use for a given annotation database over a particular reference. All arguments can be omitted to just get the base subdirectory for all pyensembl cached datasets.

Args:

reference_name (str, optional): reference name. Defaults to None.
annotation_name (str, optional): annotation name. Defaults to None.
annotation_version (int, optional): annotation version. Defaults to None.
CACHE_BASE_SUBDIR (str, optional): cache path. Defaults to 'beditor'.

Returns:

str: output path

`function` `cached_path`

cached_path(path_or_url: str, cache_directory_path: str)

When downloading remote files, the default behavior is to name local files the same as their remote counterparts.

`function` `to_downloaded_cached_path`

to_downloaded_cached_path(
    url: str,
    annots=None,
    reference_name: str = None,
    annotation_name: str = 'ensembl',
    ensembl_release: str = None,
    CACHE_BASE_SUBDIR: str = 'pyensembl'
) → str

To downloaded cached path

Args:

url (str): URL
annots (optional): pyensembl annotation. Defaults to None.
reference_name (str, optional): reference name. Defaults to None.
annotation_name (str, optional): annotation name. Defaults to 'ensembl'.
ensembl_release (str, optional): ensembl release. Defaults to None.
CACHE_BASE_SUBDIR (str, optional): cache path. Defaults to 'pyensembl'.

Returns:

str: output path

`function` `download_genome`

download_genome(
    species: str,
    ensembl_release: int,
    force: bool = False,
    verbose: bool = False
) → str

Download genome

Args:

species (str): species name
ensembl_release (int): release
force (bool, optional): force. Defaults to False.
verbose (bool, optional): verbose. Defaults to False.

Returns:

str: output path

`function` `read_genome`

read_genome(genome_path: str, fast=True)

Read genome

Args:

genome_path (str): genome path
fast (bool, optional): fast mode. Defaults to True.

`function` `to_fasta`

to_fasta(
    sequences: dict,
    output_path: str,
    molecule_type: str,
    force: bool = True,
    **kws_SeqRecord
) → str

Save fasta file.

Args:

sequences (dict): dictionary mapping the sequence name to the sequence.
output_path (str): path of the fasta file.
force (bool): overwrite if file exists.

Returns:

output_path (str): path of the fasta file

`function` `to_2bit`

to_2bit(
    genome_path: str,
    src_path: str = None,
    force: bool = False,
    verbose: bool = False
) → str

To 2bit

Args:

genome_path (str): genome path
src_path (str, optional): source path. Defaults to None.
verbose (bool, optional): verbose. Defaults to False.

Returns:

str: output path

`function` `to_fasta_index`

to_fasta_index(
    genome_path: str,
    bgzip: bool = False,
    bgzip_path: str = None,
    threads: int = 1,
    verbose: bool = True,
    force: bool = False,
    indexed: bool = False
) → str

To fasta index

Args:

genome_path (str): genome path
bgzip_path (str, optional): bgzip path. Defaults to None.
threads (int, optional): threads. Defaults to 1.
verbose (bool, optional): verbose. Defaults to True.
force (bool, optional): force. Defaults to False.
indexed (bool, optional): indexed or not. Defaults to False.

Returns:

str: output path

`function` `to_bed`

to_bed(
    df: DataFrame,
    outp: str,
    cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) → str

To bed path

Args:

df (pd.DataFrame): input table
outp (str): output path
cols (list, optional): columns. Defaults to ['chrom','start','end','locus','score','strand'].

Returns:

str: output path

`function` `read_bed`

read_bed(
    p: str,
    cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) → DataFrame

Read bed file

Args:

p (str): path
cols (list, optional): columns. Defaults to ['chrom','start','end','locus','score','strand'].

Returns:

pd.DataFrame: output table

`function` `to_viz_inputs`

to_viz_inputs(
    gtf_path: str,
    genome_path: str,
    output_dir_path: str,
    output_ext: str = 'tsv',
    threads: int = 1,
    force: bool = False
) → dict

To viz inputs for the IGV

Args:

gtf_path (str): GTF path
genome_path (str): genome path
output_dir_path (str): output directory path
output_ext (str, optional): output extension. Defaults to 'tsv'.
threads (int, optional): threads. Defaults to 1.
force (bool, optional): force. Defaults to False.

Returns:

dict: configuration

`function` `to_igv_path_prefix`

to_igv_path_prefix() → str

Get IGV path prefix

Returns:

str: URL

`function` `to_session_path`

to_session_path(p: str, path_prefix: str = None, outp: str = None) → str

To session path

Args:

p (str): session configuration path
path_prefix (str, optional): path prefix. Defaults to None.
outp (str, optional): output path. Defaults to None.

Returns:

str: output path

`function` `read_cytobands`

read_cytobands(
    cytobands_path: str,
    col_chrom: str = 'chromosome',
    remove_prefix: str = 'chr'
) → DataFrame

Read cytobands

Args:

cytobands_path (str): path
col_chrom (str, optional): column with contig. Defaults to 'chromosome'.

Returns:

pd.DataFrame: output table

`function` `to_output`

to_output(inputs: DataFrame, guides: DataFrame, scores: DataFrame) → DataFrame

To output table

Args:

inputs (pd.DataFrame): inputs
guides (pd.DataFrame): guides
scores (pd.DataFrame): scores

Returns:

pd.DataFrame: output table

`module` `beditor.lib.make_guides`

Designing the sgRNAs

`function` `get_guide_pam`

get_guide_pam(
    match: str,
    pam_stream: str,
    guidel: int,
    seq: str,
    pos_codon: int = None
)

`function` `get_pam_searches`

get_pam_searches(dpam: DataFrame, seq: str, pos_codon: int) → DataFrame

Search PAM occurance

:param dpam: dataframe with PAM sequences :param seq: target sequence :param pos_codon: reading frame :param test: debug mode on :returns dpam_searches: dataframe with positions of pams

`function` `get_guides`

get_guides(
    data: DataFrame,
    dpam: DataFrame,
    guide_len: int,
    base_fraction_max: float = 0.8
) → DataFrame

Get guides

Args:

data (pd.DataFrame): input table
dpam (pd.DataFrame): table with PAM info
guide_len (int): guide length
base_fraction_max (float, optional): base fraction max. Defaults to 0.8.

Returns:

pd.DataFrame: output table

`function` `to_locusby_pam`

to_locusby_pam(
    chrom: str,
    pam_start: int,
    pam_end: int,
    pam_position: str,
    strand: str,
    length: int,
    start_off: int = 0
) → str

To locus by PAM from PAM coords.

Args:

chrom (str): chrom
pam_start (int): PAM start
pam_end (int): PAM end
pam_position (str): PAM position
strand (str): strand
length (int): length

Returns:

str: locus

`function` `to_pam_coord`

to_pam_coord(
    startf: int,
    endf: int,
    startp: int,
    endp: int,
    strand: str
) → tuple

To PAM coordinates

Args:

startf (int): start flank start
endf (int): start flank end
startp (int): start PAM start
endp (int): start PAM end
strand (str): strand

Returns:

tuple: start,end

`function` `get_distances`

get_distances(df2: DataFrame, df3: DataFrame, cfg_method: dict) → DataFrame

Get distances

Args:

df2 (pd.DataFrame): input table #1
df3 (pd.DataFrame): input table #2
cfg_method (dict): config for the method

Returns:

pd.DataFrame: output table

`function` `get_windows_seq`

get_windows_seq(s: str, l: str, wl: str, verbose: bool = False) → str

Sequence by guide strand

Args:

s (str): sequence
l (str): locus
wl (str): window locus
verbose (bool, optional): verbose. Defaults to False.

Returns:

str: window sequence

`function` `filter_guides`

filter_guides(
    df1: DataFrame,
    cfg_method: dict,
    verbose: bool = False
) → DataFrame

Filter sgRNAs

Args:

df1 (pd.DataFrame): input table
cfg_method (dict): config of the method
verbose (bool, optional): verbose. Defaults to False.

Returns:

pd.DataFrame: output table

`function` `get_window_target_overlap`

get_window_target_overlap(
    tstart: int,
    tend: int,
    wl: str,
    ws: str,
    nt: str,
    verbose: bool = False
) → tuple

Get window target overlap

Args:

tstart (int): target start
tend (int): target end
wl (str): window locus
ws (str): window sequence
nt (str): nucleotide
verbose (bool, optional): verbose. Defaults to False.

Returns:

tuple: window_overlaps_the_target,wts,nt_in_overlap,wtl

`function` `get_mutated_codon`

get_mutated_codon(
    ts: str,
    tl: str,
    tes: str,
    tel: str,
    strand: str,
    verbose: bool = False
) → str

Get mutated codon

Args:

ts (str): target sequence
tl (str): target locus
tes (str): target edited sequence
tel (str): target edited locus
strand (str): strand
verbose (bool, optional): verbose. Defaults to False.

Returns:

str: mutated codon

`function` `get_coedits_base`

get_coedits_base(
    ws: str,
    wl: str,
    wts: str,
    wtl: str,
    nt: str,
    verbose: bool = False
) → str

Get co-edited bases

Args:

ws (str): window sequence
wl (str): window locus
wts (str): window target overlap sequence
wtl (str): window target overlap locus
nt (str): nucleotide
verbose (bool, optional): verbose. Defaults to False.

Returns:

str: coedits

`module` `beditor.lib`

`module` `beditor.lib.methods`

Global Variables

multint2reg
multint2regcomplement

`function` `dpam2dpam_strands`

dpam2dpam_strands(dpam: DataFrame, pams: list) → DataFrame

Duplicates dpam dataframe to be compatible for searching PAMs on - strand

Args:

dpam (pd.DataFrame): dataframe with pam information
pams (list): pams to be used for actual designing of guides.

Returns:

pd.DataFrame: table

`function` `get_be2dpam`

get_be2dpam(
    din: DataFrame,
    methods: list = None,
    test: bool = False,
    cols_dpam: list = ['PAM', 'PAM position', 'guide length']
) → dict

Make BE to dpam mapping i.e. dict

Args:

din (pd.DataFrame): table with BE and PAM info all cols_dpam needed
methods (list, optional): method names. Defaults to None.
test (bool, optional): test-mode. Defaults to False.
cols_dpam (list, optional): columns to be used. Defaults to ['PAM', 'PAM position', 'guide length'].

Returns:

dict: output dictionary.

`module` `beditor.lib.utils`

Utilities

Global Variables

cols_muts
multint2reg
multint2regcomplement

`function` `get_src_path`

get_src_path() → str

Get the beditor source directory path.

Returns:

str: path

`function` `runbashcmd`

runbashcmd(cmd: str, test: bool = False, logf=None)

Run a bash command

Args:

cmd (str): command
test (bool, optional): test-mode. Defaults to False.
logf (optional): log file instance. Defaults to None.

`function` `log_time_elapsed`

log_time_elapsed(start)

Log time elapsed.

Args:

start (datetime): start tile

Returns:

datetime: difference in time.

`function` `rescale`

rescale(
    a: <built-in function array>,
    mn: float = None
) → <built-in function array>

Rescale a vector.

Args:

a (np.array): vector.
mn (float, optional): minimum value. Defaults to None.

Returns:

np.array: output vector

`function` `get_nt2complement`

get_nt2complement()

`function` `s2re`

s2re(s: str, ss2re: dict) → str

String to regex patterns

Args:

s (str): string
ss2re (dict): substrings to regex patterns.

Returns:

str: string with regex patterns.

`function` `parse_locus`

parse_locus(s: str, zero_based: bool = True) → tuple

parse_locus summary

Args:

s (str): location string.
zero_based (bool, optional): zero-based coordinates. Defaults to True.

Returns:

tuple: chrom, start, end, strand

Notes:

beditor outputs (including bed files) use 0-based loci pyensembl and IGV use 1-based locations

`function` `get_pos`

get_pos(s: str, l: str, reverse: bool = True, zero_based: bool = True) → Series

Expand locus to positions mapped to nucleotides.

Args:

s (str): sequence
l (str): locus
reverse (bool, optional): reverse the - strand. Defaults to True.
zero_based (bool, optional): zero based coordinates. Defaults to True.

Returns:

pd.Series: output.

`function` `get_seq`

get_seq(
    genome: str,
    contig: str,
    start: int,
    end: int,
    strand: str,
    out_type: str = 'str',
    verbose: bool = False
) → str

Extract a sequence from a genome file based on start and end positions using streaming.

Args:

genome (str): The path to the genome file in FASTA format.
contig (str): chrom
start (int): start
end (int): end
strand (str): strand
out_type (str, optional): type of the output. Defaults to 'str'.
verbose (bool, optional): verbose. Defaults to False.

Raises:

ValueError: invalid strand.

Returns:

str: The extracted sequence.

`function` `read_fasta`

read_fasta(
    fap: str,
    key_type: str = 'id',
    duplicates: bool = False,
    out_type='dict'
) → dict

Read fasta

Args:

fap (str): path
key_type (str, optional): key type. Defaults to 'id'.
duplicates (bool, optional): duplicates present. Defaults to False.

Returns:

dict: data.

Notes:

If duplicates key_type is set to description instead of id.

`function` `format_coords`

format_coords(df: DataFrame) → DataFrame

Format coordinates

Args:

df (pd.DataFrame): table

Returns:

pd.DataFrame: formated table

`function` `fetch_sequences_bp`

fetch_sequences_bp(p: str, genome: str) → DataFrame

Fetch sequences using biopython.

Args:

p (str): path to the bed file.
genome (str): genome path.

Returns:

pd.DataFrame: sequences.

`function` `fetch_sequences`

fetch_sequences(
    p: str,
    genome_path: str,
    outp: str = None,
    src_path: str = None,
    revcom: bool = True,
    method='2bit',
    out_type='df'
) → DataFrame

Fetch sequences

Args:

p (str): path to the bed file
genome_path (str): genome path
outp (str, optional): output path for fasta file. Defaults to None.
src_path (str, optional): source path. Defaults to None.
revcom (bool, optional): reverse-complement. Defaults to True.
method (str, optional): method name. Defaults to '2bit'.
out_type (str, optional): type of the output. Defaults to 'df'.

Returns:

pd.DataFrame: sequences.

`function` `get_sequences`

get_sequences(
    df1: DataFrame,
    p: str,
    genome_path: str,
    outp: str = None,
    src_path: str = None,
    revcom: bool = True,
    out_type: str = 'df',
    renames: dict = {},
    **kws_fetch_sequences
) → DataFrame

Get sequences for the loci in a table

Args:

df1 (pd.DataFrame): input table
p (str): path to the beb file
outp (str, optional): output path. Defaults to None.
src_path (str, optional): source path. Defaults to None.
revcom (bool, optional): reverse complement. Defaults to True.
out_type (str, optional): output type. Defaults to 'df'.
renames (dict, optional): renames. Defaults to {}.

Returns:

pd.DataFrame: output sequences

Notes:

Input is 1-based Output is 0-based Saves bed file and gets the sequences

`function` `to_locus`

to_locus(
    chrom: str = 'chrom',
    start: str = 'start',
    end: str = 'end',
    strand: str = 'strand',
    x: Series = None
) → str

To locus

Args:

chrom (str, optional): chrom. Defaults to 'chrom'.
start (str, optional): strart. Defaults to 'start'.
end (str, optional): end. Defaults to 'end'.
strand (str, optional): strand. Defaults to 'strand'.
x (pd.Series, optional): row of the dataframe. Defaults to None.

Returns:

str: locus

`function` `get_flanking_seqs`

get_flanking_seqs(
    df1: DataFrame,
    targets_path: str,
    flanks_path: str,
    genome: str = None,
    search_window: list = None
) → DataFrame

Get flanking sequences

Args:

df1 (pd.DataFrame): input table
targets_path (str): target sequences path
flanks_path (str): flank sequences path
genome (str, optional): genome path. Defaults to None.
search_window (list, optional): search window around the target. Defaults to None.

Returns:

pd.DataFrame: output table with sequences

`function` `get_strand`

get_strand(
    genome,
    df1: DataFrame,
    col_start: str,
    col_end: str,
    col_chrom: str,
    col_strand: str,
    col_seq: str
) → DataFrame

Get strand by comparing the aligned and fetched sequence

Args:

genome: genome instance
df1 (pd.DataFrame): input table.
col_start (str): start
col_end (str): end
col_chrom (str): chrom
col_strand (str): strand
col_seq (str): sequences

Returns:

pd.DataFrame: output table

Notes:

used for tests.

`function` `reverse_complement_multintseq`

reverse_complement_multintseq(seq: str, nt2complement: dict) → str

Reverse complement multi-nucleotide sequence

Args:

seq (str): sequence
nt2complement (dict): nucleotide to complement

Returns:

str: sequence

`function` `reverse_complement_multintseqreg`

reverse_complement_multintseqreg(
    seq: str,
    multint2regcomplement: dict,
    nt2complement: dict
) → str

Reverse complement multi-nucleotide regex patterns

Args:

seq (str): description
multint2regcomplement (dict): mapping.
nt2complement (dict): nucleotide to complement

Returns:

str: regex pattern

`function` `hamming_distance`

hamming_distance(s1: str, s2: str) → int

Return the Hamming distance between equal-length sequences

Args:

s1 (str): sequence #1
s2 (str): sequence #2

Raises:

ValueError: Undefined for sequences of unequal length

Returns:

int: distance.

`function` `align`

align(
    q: str,
    s: str,
    test: bool = False,
    psm: float = 2,
    pmm: float = 0.5,
    pgo: float = -3,
    pge: float = -1
) → str

Creates pairwise local alignment between seqeunces.

Args:

q (str): query
s (str): subject
test (bool, optional): test-mode. Defaults to False.

Returns:

str: alignment with symbols.

Notes:

REF: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html The match parameters are: CODE DESCRIPTION x No parameters. Identical characters have score of 1, otherwise 0. m A match score is the score of identical chars, otherwise mismatch score. d A dictionary returns the score of any pair of characters. c A callback function returns scores. The gap penalty parameters are: CODE DESCRIPTION x No gap penalties. s Same open and extend gap penalties for both sequences. d The sequences have different open and extend gap penalties. c A callback function returns the gap penalties.

`function` `get_orep`

get_orep(seq: str) → int

Get the overrepresentation

`function` `get_polyt_length`

get_polyt_length(s: str) → int

Counts the length of the longest polyT stretch (RNA pol3 terminator) in sequence

:param s: sequence in string format

`function` `get_annots_installed`

get_annots_installed() → DataFrame

Get a list of annotations installed.

Returns:

pd.DataFrame: output.

`function` `get_annots`

get_annots(
    species_name: str = None,
    release: int = None,
    gtf_path: str = None,
    transcript_path: str = None,
    protein_path: str = None,
    reference_name: str = 'assembly',
    annotation_name: str = 'source',
    verbose: bool = False,
    **kws_Genome
)

Get pyensembl annotation instance

Args:

species_name (str, optional): species name. Defaults to None.
release (int, optional): release number. Defaults to None.
gtf_path (str, optional): GTF path. Defaults to None.
transcript_path (str, optional): transcripts path. Defaults to None.
protein_path (str, optional): protein path. Defaults to None.
reference_name (str, optional): reference name. Defaults to 'assembly'.
annotation_name (str, optional): annotation name. Defaults to 'source'.
verbose (bool, optional): verbose. Defaults to False.

Returns: pyensembl annotation instance

`function` `to_pid`

to_pid(annots, gid: str) → str

To protein ID

Args:

annots: pyensembl annotation instance
gid (str): gene ID

Returns:

str: protein ID

`function` `to_one_based_coordinates`

to_one_based_coordinates(df: DataFrame) → DataFrame

To one based coordinates

Args:

df (pd.DataFrame): input table

Returns:

pd.DataFrame: output table.

`module` `beditor.lib.viz`

Visualizations.

`function` `to_igv`

to_igv(
    cfg: dict = None,
    gtf_path: str = None,
    genome_path: str = None,
    output_dir_path: str = None,
    threads: int = 1,
    output_ext: str = None,
    force: bool = False
) → str

To IGV session file.

Args:

cfg (dict, optional): configuration of the run. Defaults to None.
gtf_path (str, optional): path to the gtf file. Defaults to None.
genome_path (str, optional): path to the genome file. Defaults to None.
output_dir_path (str, optional): path to the output directory. Defaults to None.
threads (int, optional): threads. Defaults to 1.
output_ext (str, optional): extension of the output. Defaults to None.
force (bool, optional): force. Defaults to False.

Returns:

str: path to the session file.

`function` `get_nt_composition`

get_nt_composition(seqs: list) → DataFrame

Get nt composition.

Args:

seqs (list): list of sequences

Returns:

pd.DataFrame: table with the frequencies of the nucleotides.

`function` `plot_ntcompos`

plot_ntcompos(
    seqs: list,
    pam_pos: str,
    pam_len: int,
    window: list = None,
    ax: Axes = None,
    color_pam: str = 'lime',
    color_window: str = 'gold'
) → Axes

Plot nucleotide composition

Args:

seqs (list): list of sequences.
pam_pos (str): PAM position.
pam_len (int): PAM length.
window (list, optional): activity window bounds. Defaults to None.
ax (plt.Axes, optional): subplot. Defaults to None.
color_pam (str, optional): color of the PAM. Defaults to 'lime'.
color_window (str, optional): color of the wnindow. Defaults to 'gold'.

Returns:

plt.Axes: subplot

`function` `plot_ontarget`

plot_ontarget(
    guide_loc: str,
    pam_pos: str,
    pam_len: int,
    guidepam_seq: str,
    window: list = None,
    show_title: bool = False,
    figsize: list = [10, 2],
    verbose: bool = False,
    kws_sg: dict = {}
) → Axes

plot_ontarget summary

Args:

guide_loc (str): sgRNA locus
pam_pos (str): PAM position
pam_len (int): PAM length
guidepam_seq (str): sgRNA and PAM sequence
window (list, optional): activity window bounds. Defaults to None.
show_title (bool, optional): show the title. Defaults to False.
figsize (list, optional): figure size. Defaults to [10,2].
verbose (bool, optional): verbose. Defaults to False.
kws_sg (dict, optional): keyword arguments to plot the sgRNA. Defaults to {}.

Returns:

plt.Axes: subplot

TODOs: 1. convert to 1-based coordinates 2. features from the GTF file

`function` `get_plot_inputs`

get_plot_inputs(df2: DataFrame) → list

Get plot inputs.

Args:

df2 (pd.DataFrame): table.

Returns:

list: list of tables.

`function` `plot_library_stats`

plot_library_stats(
    dfs: list,
    palette: dict = {True: 'b', False: 'lightgray'},
    cutoffs: dict = None,
    not_be: bool = True,
    dbug: bool = False,
    figsize: list = [10, 2.5]
) → list

Plot library stats

Args:

dfs (list): list of tables.
palette (type, optional): color palette. Defaults to {True:'b',False:'lightgray'}.
cutoffs (dict, optional): cutoffs to be applied. Defaults to None.
not_be (bool, optional): not a base editor. Defaults to True.
dbug (bool, optional): debug mode. Defaults to False.
figsize (list, optional): figure size. Defaults to [10,2.5].

Returns:

list: list of subplots.

`module` `beditor.run`

Command-line options

`function` `validate_params`

validate_params(parameters: dict) → bool

Validate the parameters.

Args:

parameters (dict): parameters

Returns:

bool: whther the parameters are valid or not

`function` `cli`

cli(
    editor: str = None,
    mutations_path: str = None,
    output_dir_path: str = None,
    species: str = None,
    ensembl_release: int = None,
    genome_path: str = None,
    gtf_path: str = None,
    rna_path: str = None,
    prt_path: str = None,
    search_window: int = None,
    not_be: bool = False,
    config_path: str = None,
    wd_path: str = None,
    threads: int = 1,
    kernel_name: str = 'beditor',
    verbose='WARNING',
    igv_path_prefix=None,
    ext: str = None,
    force: bool = False,
    dbug: bool = False,
    skip=None,
    **kws
)

beditor command-line (CLI)

Args:

editor (str, optional): base-editing method, available methods can be listed using command: 'beditor resources'. Defaults to None.
mutations_path (str, optional): path to the mutation file, the format of which is available at https://github.com/rraadd88/beditor/README.md#Input-format. Defaults to None.
output_dir_path (str, optional): path to the directory where the outputs should be saved. Defaults to None.
species (str, optional): species name. Defaults to None.
ensembl_release (int, optional): ensemble release number. Defaults to None.
genome_path (str, optional): path to the genome file, which is not available on Ensembl. Defaults to None.
gtf_path (str, optional): path to the gene annotations file, which is not available on Ensembl. Defaults to None.
rna_path (str, optional): path to the transcript sequences file, which is not available on Ensembl. Defaults to None.
prt_path (str, optional): path to the protein sequences file, which is not available on Ensembl. Defaults to None.
search_window (int, optional): number of bases to search on either side of a target, if not specified, it is inferred by beditor. Defaults to None.
not_be (bool, optional): do not process as a base editor. Defaults to False.
config_path (str, optional): path to the configuration file. Defaults to None.
wd_path (str, optional): path to the working directory. Defaults to None.
threads (int, optional): number of threads. Defaults to 1.
kernel_name (str, optional): name of the jupyter kernel. Defaults to "beditor".
verbose (str, optional): verbose, logging levels: DEBUG > INFO > WARNING > ERROR (default) > CRITICAL. Defaults to "WARNING".
igv_path_prefix (type, optional): prefix to be added to the IGV url. Defaults to None.
ext (str, optional): file extensions of the output tables. Defaults to None.
force (bool, optional): overwrite the outputs of they exist. Defaults to False.
dbug (bool, optional): debug mode (developer). Defaults to False.
skip (type, optional): skip sections of the workflow (developer). Defaults to None.

Examples: beditor cli -c inputs/mutations/protein/positions.yml

Notes:

Required parameters for a run: editor mutations_path output_dir_path or config_path

`function` `gui`

gui()

`function` `resources`

resources()

rraadd88 / beditor

readme

beditor(v2)

Usage

🖱️ GUI-mode

▶️ CLI-mode

Installation

Virtual environment and namming kernel (recommended)

Installation of the package

Optional dependencies, as required:

Input format

Output format

Supported base editing methods

Change log

v2

Future directions, for which contributions are welcome:

Similar projects:

How to cite?

v2

Future directions, for which contributions are welcome:

Similar projects:

API

module beditor.lib.get_mutations

function get_protein_cds_coords

function get_protein_mutation_coords

function map_coords

function get_mutation_coords_protein

function get_mutation_coords

module beditor.lib.get_scores

function get_ppamdist

function get_beditorscore_per_alignment

function get_beditorscore_per_guide

function revcom

function calc_cfd

function get_cfdscore

module beditor.lib.get_specificity

function run_alignment

function read_sam

function parse_XA

function get_extra_alignments

function to_pam_coord

function get_alignments

function get_penalties

function score_alignments

function score_guides

module beditor.lib.io

function download_annots

function cache_subdirectory

function cached_path

function to_downloaded_cached_path

function download_genome

function read_genome

function to_fasta

function to_2bit

function to_fasta_index

function to_bed

function read_bed

function to_viz_inputs

function to_igv_path_prefix

function to_session_path

function read_cytobands

function to_output

module beditor.lib.make_guides

function get_guide_pam

function get_pam_searches

function get_guides

function to_locusby_pam

function to_pam_coord

function get_distances

function get_windows_seq

function filter_guides

function get_window_target_overlap

function get_mutated_codon

function get_coedits_base

module beditor.lib

module beditor.lib.methods

Global Variables

function dpam2dpam_strands

function get_be2dpam

module beditor.lib.utils

`beditor`(v2)

`module` `beditor.lib.get_mutations`

`function` `get_protein_cds_coords`

`function` `get_protein_mutation_coords`

`function` `map_coords`

`function` `get_mutation_coords_protein`

`function` `get_mutation_coords`

`module` `beditor.lib.get_scores`

`function` `get_ppamdist`

`function` `get_beditorscore_per_alignment`

`function` `get_beditorscore_per_guide`

`function` `revcom`

`function` `calc_cfd`

`function` `get_cfdscore`

`module` `beditor.lib.get_specificity`

`function` `run_alignment`

`function` `read_sam`

`function` `parse_XA`

`function` `get_extra_alignments`

`function` `to_pam_coord`

`function` `get_alignments`

`function` `get_penalties`

`function` `score_alignments`

`function` `score_guides`

`module` `beditor.lib.io`

`function` `download_annots`

`function` `cache_subdirectory`

`function` `cached_path`

`function` `to_downloaded_cached_path`

`function` `download_genome`

`function` `read_genome`

`function` `to_fasta`

`function` `to_2bit`

`function` `to_fasta_index`

`function` `to_bed`

`function` `read_bed`

`function` `to_viz_inputs`

`function` `to_igv_path_prefix`

`function` `to_session_path`

`function` `read_cytobands`

`function` `to_output`

`module` `beditor.lib.make_guides`

`function` `get_guide_pam`

`function` `get_pam_searches`

`function` `get_guides`

`function` `to_locusby_pam`

`function` `to_pam_coord`

`function` `get_distances`

`function` `get_windows_seq`

`function` `filter_guides`

`function` `get_window_target_overlap`

`function` `get_mutated_codon`

`function` `get_coedits_base`

`module` `beditor.lib`

`module` `beditor.lib.methods`

`function` `dpam2dpam_strands`

`function` `get_be2dpam`

`module` `beditor.lib.utils`

`function` `get_src_path`

`function` `runbashcmd`

`function` `log_time_elapsed`

`function` `rescale`

`function` `get_nt2complement`

`function` `s2re`

`function` `parse_locus`

`function` `get_pos`

`function` `get_seq`

`function` `read_fasta`

`function` `format_coords`

`function` `fetch_sequences_bp`

`function` `fetch_sequences`

`function` `get_sequences`

`function` `to_locus`

`function` `get_flanking_seqs`

`function` `get_strand`

`function` `reverse_complement_multintseq`

`function` `reverse_complement_multintseqreg`

`function` `hamming_distance`

`function` `align`

`function` `get_orep`