beditor
(v2)A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing, and much more
beditor gui
Note: GUI is recommended for designing small libraries and prioritization of the guides.
beditor cli --editor BE1 -m path/to/mutations.tsv -o path/to/output_directory/ --species human --ensembl-release 110
or
beditor cli -c beditor_config.yml
conda env create -n beditor python=3.9; # options: conda/mamba, python=3.9/3.8
python -m ipykernel install --user --name beditor
pip install beditor[all]
pip install beditor # only cli
pip install beditor[gui] # plus gui
For fast processing of large genomes (highly recommended for human genome):
conda install install bioconda::ucsc-fatotwobit bioconda::ucsc-twobittofa bioconda::ucsc-twobitinfo # options: conda/mamba
Else, for moderately fast processing,
conda install install bioconda::bedtools # options: conda/mamba
Note: The coordinates are 1-based (i.e. X:1-1
instead of X:0:1
) and IDs correspond to the chosen genome assemblies (e.g. from Ensembl).
Point mutations
chrom start end strand mutation
5 1123 1123 + C
Position scanning
chrom start end strand
5 1123 1123 +
Region scanning
chrom start end strand
5 1123 2123 +
Protein point mutations
protein id aa pos mutation
ENSP1123 43 S
Protein position scanning
protein id aa pos
ENSP1123 43
Protein region scanning
protein id aa start aa end
ENSP1123 43 143
Note: Ensembl protein IDs are used.
Note: output contains 0-based coordinates are used.
guide sequence guide locus offtargets score {columns in the input}
AGCGTTTGGCAAATCAAACAAAA 4:1003215-1003238(+) 0 1 ..
method | nucleotide | nucleotide mutation | window start | window end | guide length | PAM | PAM position |
---|---|---|---|---|---|---|---|
A3A-BE3 | C | T | 4 | 8 | 20 | NGG | down |
ABE7.10 | A | G | 4 | 7 | 20 | NGG | down |
ABE7.10* | A | G | 4 | 8 | 20 | NGG | down |
ABE7.9 | A | G | 5 | 8 | 20 | NGG | down |
ABESa | A | G | 6 | 12 | 21 | NNGRRT | down |
BE-PLUS | C | T | 4 | 14 | 20 | NGG | down |
BE1 | C | T | 4 | 8 | 20 | NGG | down |
BE2 | C | T | 4 | 8 | 20 | NGG | down |
BE3 | C | T | 4 | 8 | 20 | NGG | down |
BE4-Gam | C | T | 4 | 8 | 20 | NGG | down |
BE4/BE4max | C | T | 4 | 8 | 20 | NGG | down |
Cas12a-BE | C | T | 10 | 12 | 23 | TTTV | up |
eA3A-BE3 | C | T | 4 | 8 | 20 | NGG | down |
EE-BE3 | C | T | 5 | 6 | 20 | NGG | down |
HF-BE3 | C | T | 4 | 8 | 20 | NGG | down |
Sa(KKH)-ABE | A | G | 6 | 12 | 21 | NNNRRT | down |
SA(KKH)-BE3 | C | T | 3 | 12 | 21 | NNNRRT | down |
SaBE3 | C | T | 3 | 12 | 21 | NNGRRT | down |
SaBE4 | C | T | 3 | 12 | 21 | NNGRRT | down |
SaBE4-Gam | C | T | 3 | 12 | 21 | NNGRRT | down |
Target-AID | C | T | 2 | 4 | 20 | NGG | down |
Target-AID | C | T | 2 | 4 | 20 | NG | down |
VQR-ABE | A | G | 4 | 6 | 20 | NGA | down |
VQR-BE3 | C | T | 4 | 11 | 20 | NGAN | down |
VRER-ABE | A | G | 4 | 6 | 20 | NGCG | down |
VRER-BE3 | C | T | 3 | 10 | 20 | NGCG | down |
xBE3 | C | T | 4 | 8 | 20 | NG | down |
YE1-BE3 | C | T | 5 | 7 | 20 | NGG | down |
YE2-BE3 | C | T | 5 | 6 | 20 | NGG | down |
YEE-BE3 | C | T | 5 | 6 | 20 | NGG | down |
Favorite base editor not listed?
Please send the required info using a PR, or an issue.
New features:
gui
contains library filtering and prioritization options.not_be
option. Key updates:
bwa
comes in the package, and samtools
not needed).pandas
etc. Technical updates:
gui
is powered by mercury
, thus overcomming the limitations of v1.method
) per run, instead of multiple. multiprocessing
.cli
is compatible with python 3.8 and 3.9 (even higher untested versions), however the gui
not supported on python 3.7 due lack of dependencies.Using BibTeX:
@software{Dandage_beditor,
title = {beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing},
author = {Dandage, Rohan},
year = {2024},
url = {https://doi.org/10.5281/zenodo.10648264},
version = {v2.0.1},
note = {The URL is a DOI link to the permanent archive of the software.},
}
Using citation information from CITATION.CFF file.
beditor.lib.get_mutations
Mutation co-ordinates using pyensembl
get_protein_cds_coords
get_protein_cds_coords(annots, protein_id: str) β DataFrame
Get protein CDS coordinates
Args:
annots
: pyensembl annotations protein_id
(str): protein ID Returns:
pd.DataFrame
: output table get_protein_mutation_coords
get_protein_mutation_coords(data: DataFrame, aapos: int, test=False) β tuple
Get protein mutation coordinates
Args:
data
(pd.DataFrame): input table aapos
(int): amino acid position test
(bool, optional): test-mode. Defaults to False. Raises:
ValueError
: invalid positions Returns:
tuple
: aapos,start,end,seq map_coords
map_coords(df_: DataFrame, df1_: DataFrame, verbose: bool = False) β DataFrame
Map coordinates
Args:
df_
(pd.DataFrame): input table Returns:
pd.DataFrame
: output table get_mutation_coords_protein
get_mutation_coords_protein(
df0: DataFrame,
annots,
search_window: int,
outd: str = None,
force: bool = False,
verbose: bool = False
) β DataFrame
Get mutation coordinates for protein
Args:
df0
(pd.DataFrame): input table annots
(type): pyensembl annotations search_window
(int): search window length on either side of the target outd
(str, optional): output directory path. Defaults to None. force
(bool, optional): force. Defaults to False. verbose
(bool, optional): verbose. Defaults to False. Returns:
pd.DataFrame
: output table get_mutation_coords
get_mutation_coords(
df0: DataFrame,
annots,
search_window: int,
verbose: bool = False,
**kws_protein
) β DataFrame
Get mutation coordinates
Args:
df0
(pd.DataFrame): input table annots
(type): pyensembl annotation search_window
(int): search window length on either side of the target verbose
(bool, optional): verbose. Defaults to False. Returns:
pd.DataFrame
: output table beditor.lib.get_scores
Scores
get_ppamdist
get_ppamdist(
guide_length: int,
pam_len: int,
pam_pos: str,
ppamdist_min: int
) β DataFrame
Get penalties set based on distances of the mismatch/es from PAM
:param guide_length: length of guide sequence :param pam_len: length of PAM sequence :param pam_pos: PAM location 3' or 5' :param ppamdist_min: minimum penalty :param pmutatpam: penalty for mismatch at PAM
TODOs: Use different scoring function for different methods.
get_beditorscore_per_alignment
get_beditorscore_per_alignment(
NM: int,
alignment: str,
pam_len: int,
pam_pos: str,
pentalty_genic: float = 0.5,
pentalty_intergenic: float = 0.9,
pentalty_dist_from_pam: float = 0.1,
verbose: bool = False
) β float
Calculates beditor score per alignment between guide and genomic DNA.
:param NM: Hamming distance :param mismatches_max: Maximum mismatches allowed in alignment :param alignment: Symbol '|' means a match, '.' means mismatch and ' ' means gap. e.g. |||||.||||||||||.||||.| :param pentalty_genic: penalty for genic alignment :param pentalty_intergenic: penalty for intergenic alignment :param pentalty_dist_from_pam: maximum pentalty for a mismatch at PAM () :returns: beditor score per alignment.
get_beditorscore_per_guide
get_beditorscore_per_guide(
guide_seq: str,
strategy: str,
align_seqs_scores: DataFrame,
dBEs: DataFrame,
penalty_activity_window: float = 0.5,
test: bool = False
) β float
Calculates beditor score per guide.
:param guide_seq: guide seqeunce 23nts :param strategy: strategy string eg. ABE;+;@-14;ACT:GCT;T:A; :param align_seqs_scores: list of beditor scores per alignments for all the alignments between guide and genomic DNA :param penalty_activity_window: if editable base is not in activity window, penalty_activity_window=0.5 :returns: beditor score per guide.
revcom
revcom(s)
calc_cfd
calc_cfd(wt, sg, pam)
get_cfdscore
get_cfdscore(wt, off)
beditor.lib.get_specificity
Specificities
run_alignment
run_alignment(
src_path: str,
genomep: str,
guidesfap: str,
guidessamp: str,
guidel: int,
mismatches_max: int = 2,
threads: int = 1,
force: bool = False,
verbose: bool = False
) β str
Run alignment
Args:
src_path
(str): source path genomep
(str): genome path guidesfap
(str): guide fasta path guidessamp
(str): guide sam path threads
(int, optional): threads. Defaults to 1. force
(bool, optional): force. Defaults to False. verbose
(bool, optional): verbose. Defaults to False. Returns:
str
: alignment file. read_sam
read_sam(align_path: str) β DataFrame
read alignment file
Args:
align_path
(str): path to the alignment file Returns:
pd.DataFrame
: output table Notes:
Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds Reference: https://bio-bwa.sourceforge.net/bwa.shtml
parse_XA
parse_XA(XA: str) β DataFrame
Parse XA tags
Args:
XA
(str): XA tag Notes:
format: (chr,pos,CIGAR,NM;)
Example: XA='4,+908051,23M,0;4,+302823,23M,0;4,-183556,23M,0;4,+1274932,23M,0;4,+207765,23M,0;4,+456906,23M,0;4,-1260135,23M,0;4,+454215,23M,0;4,-1177442,23M,0;4,+955254,23M,1;4,+1167921,23M,1;4,-613257,23M,1;4,+857893,23M,1;4,-932678,23M,2;4,-53825,23M,2;4,+306783,23M,2;'
get_extra_alignments
get_extra_alignments(
df1: DataFrame,
genome: str,
bed_path: str,
alignments_max: int = 10,
threads: int = 1
) β DataFrame
Get extra alignments
Args:
df1
(pd.DataFrame): input table alignments_max
(int, optional): alignments max. Defaults to 10. threads
(int, optional): threads. Defaults to 1. Returns:
pd.DataFrame
: output table TODOs: 1. apply parallel processing to get_seq
to_pam_coord
to_pam_coord(
pam_pos: str,
pam_len: int,
align_start: int,
align_end: int,
strand: str
) β tuple
Get PAM coords
Args:
pam_pos
(str): PAM position pam_len
(int): PAM length align_start
(int): alignment start align_end
(int): alignment end strand
(str): strand Returns:
tuple
: start,end get_alignments
get_alignments(
align_path: str,
genome: str,
alignments_max: int,
pam_pos: str,
pam_len: int,
guide_len: int,
pam_pattern: str,
pam_bed_path: str,
extra_bed_path: str,
**kws_xa
) β DataFrame
Get alignments
Args:
align_path
(str): alignement path genome
(str): genome path pam_pos
(str): PAM position pam_len
(int): PAM length guide_len
(int): sgRNA length pam_pattern
(str): PAM pattern pam_bed_path
(str): PAM bed path Returns:
pd.DataFrame
: output path get_penalties
get_penalties(
aligns: DataFrame,
guides: DataFrame,
annots: DataFrame
) β DataFrame
Get penalties
Args:
aligns
(pd.DataFrame): alignements guides
(pd.DataFrame): guides annots
(pd.DataFrame): annotations Returns:
pd.DataFrame
: output table score_alignments
score_alignments(
df4: DataFrame,
pam_len: int,
pam_pos: str,
pentalty_genic: float = 0.5,
pentalty_intergenic: float = 0.9,
pentalty_dist_from_pam: float = 0.1,
verbose: bool = False
) β tuple
score_alignments summary
Args:
df4
(pd.DataFrame): input table pam_pos
(str): PAM position pentalty_genic
(float, optional): penalty for offtarget in genic locus. Defaults to 0.5. pentalty_intergenic
(float, optional): penalty for offtarget in intergenic locus. Defaults to 0.9. pentalty_dist_from_pam
(float, optional): penalty for offtarget wrt distance from PAM. Defaults to 0.1. verbose
(bool, optional): verbose. Defaults to False. Returns:
tuple
: tables Note:
- Low value corresponds to high penalty and vice versa, because values are multiplied. 2. High penalty means consequential offtarget alignment and vice versa.
score_guides
score_guides(
guides: DataFrame,
scores: DataFrame,
not_be: bool = False
) β DataFrame
Score guides
Args:
guides
(pd.DataFrame): guides scores
(pd.DataFrame): scores not_be
(bool, optional): not a base editor. Defaults to False. Returns:
pd.DataFrame
: output table Changes: penalty_activity_window disabled as only the sgRNAs with target in the window are reported.
beditor.lib.io
Input/Output
download_annots
download_annots(species_name: str, release: int) β bool
Download annotations using pyensembl
Args:
species_name
(str): species name release
(int): release number Returns:
bool
: whether annotation is downloaded or not cache_subdirectory
cache_subdirectory(
reference_name: str = None,
annotation_name: str = None,
annotation_version: int = None,
CACHE_BASE_SUBDIR: str = 'beditor'
) β str
Which cache subdirectory to use for a given annotation database over a particular reference. All arguments can be omitted to just get the base subdirectory for all pyensembl cached datasets.
Args:
reference_name
(str, optional): reference name. Defaults to None. annotation_name
(str, optional): annotation name. Defaults to None. annotation_version
(int, optional): annotation version. Defaults to None. CACHE_BASE_SUBDIR
(str, optional): cache path. Defaults to 'beditor'. Returns:
str
: output path cached_path
cached_path(path_or_url: str, cache_directory_path: str)
When downloading remote files, the default behavior is to name local files the same as their remote counterparts.
to_downloaded_cached_path
to_downloaded_cached_path(
url: str,
annots=None,
reference_name: str = None,
annotation_name: str = 'ensembl',
ensembl_release: str = None,
CACHE_BASE_SUBDIR: str = 'pyensembl'
) β str
To downloaded cached path
Args:
url
(str): URL annots
(optional): pyensembl annotation. Defaults to None. reference_name
(str, optional): reference name. Defaults to None. annotation_name
(str, optional): annotation name. Defaults to 'ensembl'. ensembl_release
(str, optional): ensembl release. Defaults to None. CACHE_BASE_SUBDIR
(str, optional): cache path. Defaults to 'pyensembl'. Returns:
str
: output path download_genome
download_genome(
species: str,
ensembl_release: int,
force: bool = False,
verbose: bool = False
) β str
Download genome
Args:
species
(str): species name ensembl_release
(int): release force
(bool, optional): force. Defaults to False. verbose
(bool, optional): verbose. Defaults to False. Returns:
str
: output path read_genome
read_genome(genome_path: str, fast=True)
Read genome
Args:
genome_path
(str): genome path fast
(bool, optional): fast mode. Defaults to True. to_fasta
to_fasta(
sequences: dict,
output_path: str,
molecule_type: str,
force: bool = True,
**kws_SeqRecord
) β str
Save fasta file.
Args:
sequences
(dict): dictionary mapping the sequence name to the sequence. output_path
(str): path of the fasta file. force
(bool): overwrite if file exists. Returns:
output_path
(str): path of the fasta file to_2bit
to_2bit(
genome_path: str,
src_path: str = None,
force: bool = False,
verbose: bool = False
) β str
To 2bit
Args:
genome_path
(str): genome path src_path
(str, optional): source path. Defaults to None. verbose
(bool, optional): verbose. Defaults to False. Returns:
str
: output path to_fasta_index
to_fasta_index(
genome_path: str,
bgzip: bool = False,
bgzip_path: str = None,
threads: int = 1,
verbose: bool = True,
force: bool = False,
indexed: bool = False
) β str
To fasta index
Args:
genome_path
(str): genome path bgzip_path
(str, optional): bgzip path. Defaults to None. threads
(int, optional): threads. Defaults to 1. verbose
(bool, optional): verbose. Defaults to True. force
(bool, optional): force. Defaults to False. indexed
(bool, optional): indexed or not. Defaults to False. Returns:
str
: output path to_bed
to_bed(
df: DataFrame,
outp: str,
cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) β str
To bed path
Args:
df
(pd.DataFrame): input table outp
(str): output path cols
(list, optional): columns. Defaults to ['chrom','start','end','locus','score','strand']. Returns:
str
: output path read_bed
read_bed(
p: str,
cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) β DataFrame
Read bed file
Args:
p
(str): path cols
(list, optional): columns. Defaults to ['chrom','start','end','locus','score','strand']. Returns:
pd.DataFrame
: output table to_viz_inputs
to_viz_inputs(
gtf_path: str,
genome_path: str,
output_dir_path: str,
output_ext: str = 'tsv',
threads: int = 1,
force: bool = False
) β dict
To viz inputs for the IGV
Args:
gtf_path
(str): GTF path genome_path
(str): genome path output_dir_path
(str): output directory path output_ext
(str, optional): output extension. Defaults to 'tsv'. threads
(int, optional): threads. Defaults to 1. force
(bool, optional): force. Defaults to False. Returns:
dict
: configuration to_igv_path_prefix
to_igv_path_prefix() β str
Get IGV path prefix
Returns:
str
: URL to_session_path
to_session_path(p: str, path_prefix: str = None, outp: str = None) β str
To session path
Args:
p
(str): session configuration path path_prefix
(str, optional): path prefix. Defaults to None. outp
(str, optional): output path. Defaults to None. Returns:
str
: output path read_cytobands
read_cytobands(
cytobands_path: str,
col_chrom: str = 'chromosome',
remove_prefix: str = 'chr'
) β DataFrame
Read cytobands
Args:
cytobands_path
(str): path col_chrom
(str, optional): column with contig. Defaults to 'chromosome'. Returns:
pd.DataFrame
: output table to_output
to_output(inputs: DataFrame, guides: DataFrame, scores: DataFrame) β DataFrame
To output table
Args:
inputs
(pd.DataFrame): inputs guides
(pd.DataFrame): guides scores
(pd.DataFrame): scores Returns:
pd.DataFrame
: output table beditor.lib.make_guides
Designing the sgRNAs
get_guide_pam
get_guide_pam(
match: str,
pam_stream: str,
guidel: int,
seq: str,
pos_codon: int = None
)
get_pam_searches
get_pam_searches(dpam: DataFrame, seq: str, pos_codon: int) β DataFrame
Search PAM occurance
:param dpam: dataframe with PAM sequences :param seq: target sequence :param pos_codon: reading frame :param test: debug mode on :returns dpam_searches: dataframe with positions of pams
get_guides
get_guides(
data: DataFrame,
dpam: DataFrame,
guide_len: int,
base_fraction_max: float = 0.8
) β DataFrame
Get guides
Args:
data
(pd.DataFrame): input table dpam
(pd.DataFrame): table with PAM info guide_len
(int): guide length base_fraction_max
(float, optional): base fraction max. Defaults to 0.8. Returns:
pd.DataFrame
: output table to_locusby_pam
to_locusby_pam(
chrom: str,
pam_start: int,
pam_end: int,
pam_position: str,
strand: str,
length: int,
start_off: int = 0
) β str
To locus by PAM from PAM coords.
Args:
chrom
(str): chrom pam_start
(int): PAM start pam_end
(int): PAM end pam_position
(str): PAM position strand
(str): strand length
(int): length Returns:
str
: locus to_pam_coord
to_pam_coord(
startf: int,
endf: int,
startp: int,
endp: int,
strand: str
) β tuple
To PAM coordinates
Args:
startf
(int): start flank start endf
(int): start flank end startp
(int): start PAM start endp
(int): start PAM end strand
(str): strand Returns:
tuple
: start,end get_distances
get_distances(df2: DataFrame, df3: DataFrame, cfg_method: dict) β DataFrame
Get distances
Args:
df2
(pd.DataFrame): input table #1 df3
(pd.DataFrame): input table #2 cfg_method
(dict): config for the method Returns:
pd.DataFrame
: output table get_windows_seq
get_windows_seq(s: str, l: str, wl: str, verbose: bool = False) β str
Sequence by guide strand
Args:
s
(str): sequence l
(str): locus wl
(str): window locus verbose
(bool, optional): verbose. Defaults to False. Returns:
str
: window sequence filter_guides
filter_guides(
df1: DataFrame,
cfg_method: dict,
verbose: bool = False
) β DataFrame
Filter sgRNAs
Args:
df1
(pd.DataFrame): input table cfg_method
(dict): config of the method verbose
(bool, optional): verbose. Defaults to False. Returns:
pd.DataFrame
: output table get_window_target_overlap
get_window_target_overlap(
tstart: int,
tend: int,
wl: str,
ws: str,
nt: str,
verbose: bool = False
) β tuple
Get window target overlap
Args:
tstart
(int): target start tend
(int): target end wl
(str): window locus ws
(str): window sequence nt
(str): nucleotide verbose
(bool, optional): verbose. Defaults to False. Returns:
tuple
: window_overlaps_the_target,wts,nt_in_overlap,wtl get_mutated_codon
get_mutated_codon(
ts: str,
tl: str,
tes: str,
tel: str,
strand: str,
verbose: bool = False
) β str
Get mutated codon
Args:
ts
(str): target sequence tl
(str): target locus tes
(str): target edited sequence tel
(str): target edited locus strand
(str): strand verbose
(bool, optional): verbose. Defaults to False. Returns:
str
: mutated codon get_coedits_base
get_coedits_base(
ws: str,
wl: str,
wts: str,
wtl: str,
nt: str,
verbose: bool = False
) β str
Get co-edited bases
Args:
ws
(str): window sequence wl
(str): window locus wts
(str): window target overlap sequence wtl
(str): window target overlap locus nt
(str): nucleotide verbose
(bool, optional): verbose. Defaults to False. Returns:
str
: coedits beditor.lib
beditor.lib.methods
dpam2dpam_strands
dpam2dpam_strands(dpam: DataFrame, pams: list) β DataFrame
Duplicates dpam dataframe to be compatible for searching PAMs on - strand
Args:
dpam
(pd.DataFrame): dataframe with pam information pams
(list): pams to be used for actual designing of guides. Returns:
pd.DataFrame
: table get_be2dpam
get_be2dpam(
din: DataFrame,
methods: list = None,
test: bool = False,
cols_dpam: list = ['PAM', 'PAM position', 'guide length']
) β dict
Make BE to dpam mapping i.e. dict
Args:
din
(pd.DataFrame): table with BE and PAM info all cols_dpam needed methods
(list, optional): method names. Defaults to None. test
(bool, optional): test-mode. Defaults to False. cols_dpam
(list, optional): columns to be used. Defaults to ['PAM', 'PAM position', 'guide length']. Returns:
dict
: output dictionary. beditor.lib.utils
Utilities
get_src_path
get_src_path() β str
Get the beditor source directory path.
Returns:
str
: path runbashcmd
runbashcmd(cmd: str, test: bool = False, logf=None)
Run a bash command
Args:
cmd
(str): command test
(bool, optional): test-mode. Defaults to False. logf
(optional): log file instance. Defaults to None. log_time_elapsed
log_time_elapsed(start)
Log time elapsed.
Args:
start
(datetime): start tile Returns:
datetime
: difference in time. rescale
rescale(
a: <built-in function array>,
mn: float = None
) β <built-in function array>
Rescale a vector.
Args:
a
(np.array): vector. mn
(float, optional): minimum value. Defaults to None. Returns:
np.array
: output vector get_nt2complement
get_nt2complement()
s2re
s2re(s: str, ss2re: dict) β str
String to regex patterns
Args:
s
(str): string ss2re
(dict): substrings to regex patterns. Returns:
str
: string with regex patterns. parse_locus
parse_locus(s: str, zero_based: bool = True) β tuple
parse_locus summary
Args:
s
(str): location string. zero_based
(bool, optional): zero-based coordinates. Defaults to True. Returns:
tuple
: chrom, start, end, strand Notes:
beditor outputs (including bed files) use 0-based loci pyensembl and IGV use 1-based locations
get_pos
get_pos(s: str, l: str, reverse: bool = True, zero_based: bool = True) β Series
Expand locus to positions mapped to nucleotides.
Args:
s
(str): sequence l
(str): locus reverse
(bool, optional): reverse the - strand. Defaults to True. zero_based
(bool, optional): zero based coordinates. Defaults to True. Returns:
pd.Series
: output. get_seq
get_seq(
genome: str,
contig: str,
start: int,
end: int,
strand: str,
out_type: str = 'str',
verbose: bool = False
) β str
Extract a sequence from a genome file based on start and end positions using streaming.
Args:
genome
(str): The path to the genome file in FASTA format. contig
(str): chrom start
(int): start end
(int): end strand
(str): strand out_type
(str, optional): type of the output. Defaults to 'str'. verbose
(bool, optional): verbose. Defaults to False. Raises:
ValueError
: invalid strand. Returns:
str
: The extracted sequence. read_fasta
read_fasta(
fap: str,
key_type: str = 'id',
duplicates: bool = False,
out_type='dict'
) β dict
Read fasta
Args:
fap
(str): path key_type
(str, optional): key type. Defaults to 'id'. duplicates
(bool, optional): duplicates present. Defaults to False. Returns:
dict
: data. Notes:
- If
duplicates
key_type is set todescription
instead ofid
.
format_coords
format_coords(df: DataFrame) β DataFrame
Format coordinates
Args:
df
(pd.DataFrame): table Returns:
pd.DataFrame
: formated table fetch_sequences_bp
fetch_sequences_bp(p: str, genome: str) β DataFrame
Fetch sequences using biopython.
Args:
p
(str): path to the bed file. genome
(str): genome path. Returns:
pd.DataFrame
: sequences. fetch_sequences
fetch_sequences(
p: str,
genome_path: str,
outp: str = None,
src_path: str = None,
revcom: bool = True,
method='2bit',
out_type='df'
) β DataFrame
Fetch sequences
Args:
p
(str): path to the bed file genome_path
(str): genome path outp
(str, optional): output path for fasta file. Defaults to None. src_path
(str, optional): source path. Defaults to None. revcom
(bool, optional): reverse-complement. Defaults to True. method
(str, optional): method name. Defaults to '2bit'. out_type
(str, optional): type of the output. Defaults to 'df'. Returns:
pd.DataFrame
: sequences. get_sequences
get_sequences(
df1: DataFrame,
p: str,
genome_path: str,
outp: str = None,
src_path: str = None,
revcom: bool = True,
out_type: str = 'df',
renames: dict = {},
**kws_fetch_sequences
) β DataFrame
Get sequences for the loci in a table
Args:
df1
(pd.DataFrame): input table p
(str): path to the beb file outp
(str, optional): output path. Defaults to None. src_path
(str, optional): source path. Defaults to None. revcom
(bool, optional): reverse complement. Defaults to True. out_type
(str, optional): output type. Defaults to 'df'. renames
(dict, optional): renames. Defaults to {}. Returns:
pd.DataFrame
: output sequences Notes:
Input is 1-based Output is 0-based Saves bed file and gets the sequences
to_locus
to_locus(
chrom: str = 'chrom',
start: str = 'start',
end: str = 'end',
strand: str = 'strand',
x: Series = None
) β str
To locus
Args:
chrom
(str, optional): chrom. Defaults to 'chrom'. start
(str, optional): strart. Defaults to 'start'. end
(str, optional): end. Defaults to 'end'. strand
(str, optional): strand. Defaults to 'strand'. x
(pd.Series, optional): row of the dataframe. Defaults to None. Returns:
str
: locus get_flanking_seqs
get_flanking_seqs(
df1: DataFrame,
targets_path: str,
flanks_path: str,
genome: str = None,
search_window: list = None
) β DataFrame
Get flanking sequences
Args:
df1
(pd.DataFrame): input table targets_path
(str): target sequences path flanks_path
(str): flank sequences path genome
(str, optional): genome path. Defaults to None. search_window
(list, optional): search window around the target. Defaults to None. Returns:
pd.DataFrame
: output table with sequences get_strand
get_strand(
genome,
df1: DataFrame,
col_start: str,
col_end: str,
col_chrom: str,
col_strand: str,
col_seq: str
) β DataFrame
Get strand by comparing the aligned and fetched sequence
Args:
genome
: genome instance df1
(pd.DataFrame): input table. col_start
(str): start col_end
(str): end col_chrom
(str): chrom col_strand
(str): strand col_seq
(str): sequences Returns:
pd.DataFrame
: output table Notes:
used for tests.
reverse_complement_multintseq
reverse_complement_multintseq(seq: str, nt2complement: dict) β str
Reverse complement multi-nucleotide sequence
Args:
seq
(str): sequence nt2complement
(dict): nucleotide to complement Returns:
str
: sequence reverse_complement_multintseqreg
reverse_complement_multintseqreg(
seq: str,
multint2regcomplement: dict,
nt2complement: dict
) β str
Reverse complement multi-nucleotide regex patterns
Args:
seq
(str): description multint2regcomplement
(dict): mapping. nt2complement
(dict): nucleotide to complement Returns:
str
: regex pattern hamming_distance
hamming_distance(s1: str, s2: str) β int
Return the Hamming distance between equal-length sequences
Args:
s1
(str): sequence #1 s2
(str): sequence #2 Raises:
ValueError
: Undefined for sequences of unequal length Returns:
int
: distance. align
align(
q: str,
s: str,
test: bool = False,
psm: float = 2,
pmm: float = 0.5,
pgo: float = -3,
pge: float = -1
) β str
Creates pairwise local alignment between seqeunces.
Args:
q
(str): query s
(str): subject test
(bool, optional): test-mode. Defaults to False. Returns:
str
: alignment with symbols. Notes:
REF: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html The match parameters are: CODE DESCRIPTION x No parameters. Identical characters have score of 1, otherwise 0. m A match score is the score of identical chars, otherwise mismatch score. d A dictionary returns the score of any pair of characters. c A callback function returns scores. The gap penalty parameters are: CODE DESCRIPTION x No gap penalties. s Same open and extend gap penalties for both sequences. d The sequences have different open and extend gap penalties. c A callback function returns the gap penalties.
get_orep
get_orep(seq: str) β int
Get the overrepresentation
get_polyt_length
get_polyt_length(s: str) β int
Counts the length of the longest polyT stretch (RNA pol3 terminator) in sequence
:param s: sequence in string format
get_annots_installed
get_annots_installed() β DataFrame
Get a list of annotations installed.
Returns:
pd.DataFrame
: output. get_annots
get_annots(
species_name: str = None,
release: int = None,
gtf_path: str = None,
transcript_path: str = None,
protein_path: str = None,
reference_name: str = 'assembly',
annotation_name: str = 'source',
verbose: bool = False,
**kws_Genome
)
Get pyensembl annotation instance
Args:
species_name
(str, optional): species name. Defaults to None. release
(int, optional): release number. Defaults to None. gtf_path
(str, optional): GTF path. Defaults to None. transcript_path
(str, optional): transcripts path. Defaults to None. protein_path
(str, optional): protein path. Defaults to None. reference_name
(str, optional): reference name. Defaults to 'assembly'. annotation_name
(str, optional): annotation name. Defaults to 'source'. verbose
(bool, optional): verbose. Defaults to False. Returns: pyensembl annotation instance
to_pid
to_pid(annots, gid: str) β str
To protein ID
Args:
annots
: pyensembl annotation instance gid
(str): gene ID Returns:
str
: protein ID to_one_based_coordinates
to_one_based_coordinates(df: DataFrame) β DataFrame
To one based coordinates
Args:
df
(pd.DataFrame): input table Returns:
pd.DataFrame
: output table. beditor.lib.viz
Visualizations.
to_igv
to_igv(
cfg: dict = None,
gtf_path: str = None,
genome_path: str = None,
output_dir_path: str = None,
threads: int = 1,
output_ext: str = None,
force: bool = False
) β str
To IGV session file.
Args:
cfg
(dict, optional): configuration of the run. Defaults to None. gtf_path
(str, optional): path to the gtf file. Defaults to None. genome_path
(str, optional): path to the genome file. Defaults to None. output_dir_path
(str, optional): path to the output directory. Defaults to None. threads
(int, optional): threads. Defaults to 1. output_ext
(str, optional): extension of the output. Defaults to None. force
(bool, optional): force. Defaults to False. Returns:
str
: path to the session file. get_nt_composition
get_nt_composition(seqs: list) β DataFrame
Get nt composition.
Args:
seqs
(list): list of sequences Returns:
pd.DataFrame
: table with the frequencies of the nucleotides. plot_ntcompos
plot_ntcompos(
seqs: list,
pam_pos: str,
pam_len: int,
window: list = None,
ax: Axes = None,
color_pam: str = 'lime',
color_window: str = 'gold'
) β Axes
Plot nucleotide composition
Args:
seqs
(list): list of sequences. pam_pos
(str): PAM position. pam_len
(int): PAM length. window
(list, optional): activity window bounds. Defaults to None. ax
(plt.Axes, optional): subplot. Defaults to None. color_pam
(str, optional): color of the PAM. Defaults to 'lime'. color_window
(str, optional): color of the wnindow. Defaults to 'gold'. Returns:
plt.Axes
: subplot plot_ontarget
plot_ontarget(
guide_loc: str,
pam_pos: str,
pam_len: int,
guidepam_seq: str,
window: list = None,
show_title: bool = False,
figsize: list = [10, 2],
verbose: bool = False,
kws_sg: dict = {}
) β Axes
plot_ontarget summary
Args:
guide_loc
(str): sgRNA locus pam_pos
(str): PAM position pam_len
(int): PAM length guidepam_seq
(str): sgRNA and PAM sequence window
(list, optional): activity window bounds. Defaults to None. show_title
(bool, optional): show the title. Defaults to False. figsize
(list, optional): figure size. Defaults to [10,2]. verbose
(bool, optional): verbose. Defaults to False. kws_sg
(dict, optional): keyword arguments to plot the sgRNA. Defaults to {}. Returns:
plt.Axes
: subplot TODOs: 1. convert to 1-based coordinates 2. features from the GTF file
get_plot_inputs
get_plot_inputs(df2: DataFrame) β list
Get plot inputs.
Args:
df2
(pd.DataFrame): table. Returns:
list
: list of tables. plot_library_stats
plot_library_stats(
dfs: list,
palette: dict = {True: 'b', False: 'lightgray'},
cutoffs: dict = None,
not_be: bool = True,
dbug: bool = False,
figsize: list = [10, 2.5]
) β list
Plot library stats
Args:
dfs
(list): list of tables. palette
(type, optional): color palette. Defaults to {True:'b',False:'lightgray'}. cutoffs
(dict, optional): cutoffs to be applied. Defaults to None. not_be
(bool, optional): not a base editor. Defaults to True. dbug
(bool, optional): debug mode. Defaults to False. figsize
(list, optional): figure size. Defaults to [10,2.5]. Returns:
list
: list of subplots. beditor.run
Command-line options
validate_params
validate_params(parameters: dict) β bool
Validate the parameters.
Args:
parameters
(dict): parameters Returns:
bool
: whther the parameters are valid or not cli
cli(
editor: str = None,
mutations_path: str = None,
output_dir_path: str = None,
species: str = None,
ensembl_release: int = None,
genome_path: str = None,
gtf_path: str = None,
rna_path: str = None,
prt_path: str = None,
search_window: int = None,
not_be: bool = False,
config_path: str = None,
wd_path: str = None,
threads: int = 1,
kernel_name: str = 'beditor',
verbose='WARNING',
igv_path_prefix=None,
ext: str = None,
force: bool = False,
dbug: bool = False,
skip=None,
**kws
)
beditor command-line (CLI)
Args:
editor
(str, optional): base-editing method, available methods can be listed using command: 'beditor resources'. Defaults to None. mutations_path
(str, optional): path to the mutation file, the format of which is available at https://github.com/rraadd88/beditor/README.md#Input-format. Defaults to None. output_dir_path
(str, optional): path to the directory where the outputs should be saved. Defaults to None. species
(str, optional): species name. Defaults to None. ensembl_release
(int, optional): ensemble release number. Defaults to None. genome_path
(str, optional): path to the genome file, which is not available on Ensembl. Defaults to None. gtf_path
(str, optional): path to the gene annotations file, which is not available on Ensembl. Defaults to None. rna_path
(str, optional): path to the transcript sequences file, which is not available on Ensembl. Defaults to None. prt_path
(str, optional): path to the protein sequences file, which is not available on Ensembl. Defaults to None. search_window
(int, optional): number of bases to search on either side of a target, if not specified, it is inferred by beditor. Defaults to None. not_be
(bool, optional): do not process as a base editor. Defaults to False. config_path
(str, optional): path to the configuration file. Defaults to None. wd_path
(str, optional): path to the working directory. Defaults to None. threads
(int, optional): number of threads. Defaults to 1. kernel_name
(str, optional): name of the jupyter kernel. Defaults to "beditor". verbose
(str, optional): verbose, logging levels: DEBUG > INFO > WARNING > ERROR (default) > CRITICAL. Defaults to "WARNING". igv_path_prefix
(type, optional): prefix to be added to the IGV url. Defaults to None. ext
(str, optional): file extensions of the output tables. Defaults to None. force
(bool, optional): overwrite the outputs of they exist. Defaults to False. dbug
(bool, optional): debug mode (developer). Defaults to False. skip
(type, optional): skip sections of the workflow (developer). Defaults to None. Examples: beditor cli -c inputs/mutations/protein/positions.yml
Notes:
Required parameters for a run: editor mutations_path output_dir_path or config_path
gui
gui()
resources
resources()