tsailabSJ / changeseq

CHANGE-seq analysis pipeline
GNU Affero General Public License v3.0
11 stars 7 forks source link

Issues recreating Figure 4F from CHANGE-Seq paper #13

Closed Austin-s-h closed 1 year ago

Austin-s-h commented 1 year ago

Hello and thank you for developing CHANGE-Seq!

I wanted to reach out because I am having issues recreating the off-target results (specifically TRAC site 2 and CTL4 site 9) as presented in Figure 4F of the Nature Biotechnology publication. From SRA, I downloaded (what I think) are the appropriate files

SRR11612309-CTL4_site_9_treated_rep_3_sgRNA_a_R1.fastq.gz
SRR11612309-CTL4_site_9_treated_rep_3_sgRNA_a_R2.fastq.gz
SRR11612316-TRAC_site_2_treated_rep_3_sgRNA_b_R1.fastq.gz
SRR11612316-TRAC_site_2_treated_rep_3_sgRNA_b_R2.fastq.gz
SRR11612317-TRAC_site_2_treated_rep_2_sgRNA_b_R1.fastq.gz
SRR11612317-TRAC_site_2_treated_rep_2_sgRNA_b_R2.fastq.gz
SRR11612318-CTL4_site_9_treated_rep_2_sgRNA_a_R1.fastq.gz
SRR11612318-CTL4_site_9_treated_rep_2_sgRNA_a_R2.fastq.gz
SRR11612319-TRAC_site_2_treated_rep_1_sgRNA_b_R1.fastq.gz
SRR11612319-TRAC_site_2_treated_rep_1_sgRNA_b_R2.fastq.gz
SRR11612320-TRAC_site_2_control_rep_3_no_sgRNA_b_R1.fastq.gz
SRR11612320-TRAC_site_2_control_rep_3_no_sgRNA_b_R2.fastq.gz
SRR11612321-TRAC_site_2_control_rep_2_no_sgRNA_b_R1.fastq.gz
SRR11612321-TRAC_site_2_control_rep_2_no_sgRNA_b_R2.fastq.gz
SRR11612322-TRAC_site_2_control_rep_1_no_sgRNA_b_R1.fastq.gz
SRR11612322-TRAC_site_2_control_rep_1_no_sgRNA_b_R2.fastq.gz
SRR11612329-CTL4_site_9_treated_rep_1_sgRNA_a_R1.fastq.gz
SRR11612329-CTL4_site_9_treated_rep_1_sgRNA_a_R2.fastq.gz
SRR11612330-CTL4_site_9_treated_rep_3_sgRNA_b_R1.fastq.gz
SRR11612330-CTL4_site_9_treated_rep_3_sgRNA_b_R2.fastq.gz
SRR11612331-CTL4_site_9_treated_rep_2_sgRNA_b_R1.fastq.gz
SRR11612331-CTL4_site_9_treated_rep_2_sgRNA_b_R2.fastq.gz
SRR11612332-CTL4_site_9_treated_rep_1_sgRNA_b_R1.fastq.gz
SRR11612332-CTL4_site_9_treated_rep_1_sgRNA_b_R2.fastq.gz
SRR11612333-CTL4_site_9_control_rep_3_no_sgRNA_b_R1.fastq.gz
SRR11612333-CTL4_site_9_control_rep_3_no_sgRNA_b_R2.fastq.gz
SRR11612334-CTL4_site_9_control_rep_2_no_sgRNA_b_R1.fastq.gz
SRR11612334-CTL4_site_9_control_rep_2_no_sgRNA_b_R2.fastq.gz
SRR11612335-CTL4_site_9_control_rep_1_no_sgRNA_b_R1.fastq.gz
SRR11612335-CTL4_site_9_control_rep_1_no_sgRNA_b_R2.fastq.gz

Processing via CHANGE-Seq v1.2.9.1 using the default arguments results in the pipeline crashing after alignment. The alignment files look okay (reasonable size compared to successful samples), but the _CONTROL, _NUCLEASE, and _count.txt files are all very small (~1/100th the size of a successful sample). I attempted to combine the replicates at the FASTQ level but still ended with the same results. In addition, it appears that there are no real differences between the control and nuclease samples, as read counts are similar at the positions that are collected. The end result is no _identified_matched.txt file being created, or if there is one it is empty.

Could you help provide any information that may help recreate this off-target result? Are these input files correct? What steps within the identification script may result in this behavior?

Thank you very much!

Error logging

[06/22 05:21:26PM][INFO][changeseq] Identifying off-target cleavage sites.
[06/22 05:21:26PM][INFO][changeseq] Window: 3, MAPQ: 50, Gap: 3, Start 1, Mismatches 6, Search_Radius 20
[06/22 05:21:26PM][INFO][findCleavageSites] Writing counts to identified/CTL4_site_9_treated_combined_sgRNA_b_count.txt
[06/22 05:21:26PM][INFO][findCleavageSites] Writing counts to identified/CTL4_site_9_treated_combined_sgRNA_b_count.txt
[06/22 05:21:26PM][INFO][findCleavageSites] Tabulating merged start positions.
[06/22 05:21:26PM][INFO][findCleavageSites] Tabulating merged start positions.
[W::hts_idx_load3] The index file is older than the data file: aligned/CTL4_site_9_treated_combined_sgRNA_b.bam.bai
0.1 0.2 0.3 .... 49.1 
[06/22 05:52:24PM][INFO][findCleavageSites] Tabulating control merged start positions.
[06/22 05:52:24PM][INFO][findCleavageSites] Tabulating control merged start positions.
[W::hts_idx_load3] The index file is older than the data file: aligned/control_CTL4_site_9_treated_combined_sgRNA_b.bam.bai
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.......58.1 58.2
[06/22 06:28:52PM][INFO][findCleavageSites] Writing matched table
[06/22 06:28:52PM][INFO][findCleavageSites] Writing matched table

CRASH -- The matched table is empty

head of _CONTROL_coordinates.txt

#Name   Targetsite_Sequence     Cells   BAM     Read1_chr       Read1_start_position    Read1_strand    Read2_chr       Read2_start_position    Read2_strand
CTL4_site_9_treated_combined_sgRNA_b    GTGTGTGAGTATGCATCTCCNGG changeseq_pub_CTL4      control_CTL4_site_9_treated_combined_sgRNA_b.bam        chr1    14759656        -       chr1    14759657        +
CTL4_site_9_treated_combined_sgRNA_b    GTGTGTGAGTATGCATCTCCNGG changeseq_pub_CTL4      control_CTL4_site_9_treated_combined_sgRNA_b.bam        chr1    14759656        -       chr1    14759657        +

head of _NUCLEASE_coordinates.txt

#Name   Targetsite_Sequence     Cells   BAM     Read1_chr       Read1_start_position    Read1_strand    Read2_chr       Read2_start_position    Read2_strand
CTL4_site_9_treated_combined_sgRNA_b    GTGTGTGAGTATGCATCTCCNGG changeseq_pub_CTL4      CTL4_site_9_treated_combined_sgRNA_b.bam        chr1    14759656        -       chr1    14759657        +
CTL4_site_9_treated_combined_sgRNA_b    GTGTGTGAGTATGCATCTCCNGG changeseq_pub_CTL4      CTL4_site_9_treated_combined_sgRNA_b.bam        chr1    14759656        -       chr1    14759657        +

head of _count.txt

Chromosome      zero_based_Position     Nuclease_Position_Reads Control_Position_Reads  Nuclease_Window_Reads   Control_Window_Reads    p_Value narrow_p_Value  control_p_Value control_narrow_p_Value
chr1    14759656        113.0   136.0   226.0   272.0   1       1       1       1
chr1    14759657        113.0   136.0   226.0   272.0   1       1       1       1
chr1    14759798        0.0     1.0     25.0    18.0    1       1       1       1
chr1    14759799        12.0    9.0     26.0    18.0    1       1       1       1
chr1    14759800        12.0    8.0     26.0    18.0    1       1       1       1
shengdar commented 1 year ago

This is because the files you've referenced above are not the right files. They correspond to our targeted amplicon sequencing experiments, and are not CHANGE-seq FASTQ files. The CHANGE-seq runs are classified as 'OTHER' for the purposes of NCBI SRA and are listed in the same project.