This is not the best as the workflow breaks if pairs cannot be found and all of the counting (where the most time is spent) must be repeated. I have run into this as an issue twice: 1) I had a typo in my sample sheet (see #2), 2) we wanted to count a single sample separately (i.e., only one half of a pair).
What I propose is writing the global count as a separate file -- after the first code block here add df.to_csv("global-count.csv", ...) -- then have a "pairing" process that performs the contents starting at the second code block here.
Right now there is an internal global counts data frame generated.
https://github.com/sheltzer-lab/crispr-screening/blob/1a6f8c1cbe94433e4abfc02d47247ba92c21ade4/bin/extract-reads.py#L53-L63
However, we do not write this file out for processing. Rather, we immediately move into attempting to pair data.
https://github.com/sheltzer-lab/crispr-screening/blob/1a6f8c1cbe94433e4abfc02d47247ba92c21ade4/bin/extract-reads.py#L66-L72
This is not the best as the workflow breaks if pairs cannot be found and all of the counting (where the most time is spent) must be repeated. I have run into this as an issue twice: 1) I had a typo in my sample sheet (see #2), 2) we wanted to count a single sample separately (i.e., only one half of a pair).
What I propose is writing the global count as a separate file -- after the first code block here add
df.to_csv("global-count.csv", ...)
-- then have a "pairing" process that performs the contents starting at the second code block here.