Closed allgenesconsidered closed 6 years ago
A few other points, I've removed the --bed
option and made the script a "one-size-fits-all" solution. It can easily detect a multi-locus gens file. I've removed a ton of repetitive code in favor of this solution.
The biggest problem now is speed, which was always a problem with this script. Single-locus gens files are annotated at the same speed, but multi-locus gens file take minutes to run. Something to note.
Large changes to annot_variants.py. Before, annot_var only checked the first row's
chrom
value, making it incompatible with multi-locus gens files. It was also not very strick with this, and attempted to use the position value of a variant even if the chromosome was different.I implemented annot_var to first gather all chromosomes in a gens file, and if the script detects a multi-locus gens file the gens file dataframe is split by chromosome. The script then iterate through a list of gens dataframes, joining the dataframes at the end. I've tested this script with several cas lists and with both single-locus and multi-locus gens, and the output seems to be fine. You will also get error messages produced for missing pam.npy files or a mising FASTA file.