sheltzer-lab / crispr-screening

A Nextflow script which conducts the computational analysis associated with CRISPR screening as done within the Sheltzer Lab.
MIT License
0 stars 0 forks source link

Write a global count as a separate process #4

Open rhagenson opened 1 year ago

rhagenson commented 1 year ago

Right now there is an internal global counts data frame generated.

https://github.com/sheltzer-lab/crispr-screening/blob/1a6f8c1cbe94433e4abfc02d47247ba92c21ade4/bin/extract-reads.py#L53-L63

However, we do not write this file out for processing. Rather, we immediately move into attempting to pair data.

https://github.com/sheltzer-lab/crispr-screening/blob/1a6f8c1cbe94433e4abfc02d47247ba92c21ade4/bin/extract-reads.py#L66-L72

This is not the best as the workflow breaks if pairs cannot be found and all of the counting (where the most time is spent) must be repeated. I have run into this as an issue twice: 1) I had a typo in my sample sheet (see #2), 2) we wanted to count a single sample separately (i.e., only one half of a pair).

What I propose is writing the global count as a separate file -- after the first code block here add df.to_csv("global-count.csv", ...) -- then have a "pairing" process that performs the contents starting at the second code block here.