tsailabSJ / changeseq

CHANGE-seq analysis pipeline
GNU Affero General Public License v3.0
11 stars 7 forks source link

Could not retrieve index file for _sorted.bam files #7

Closed weihongqi closed 3 years ago

weihongqi commented 4 years ago

Dear All,

I installed changeseq via conda (miniconda3). Analysis of the test data seemed ok.

When I analyze my own dataset. The pipeline produced outputs. But in the log, there were warnings obout failing to create index files for sorted bam files:

[main] Version: 0.7.17-r1188 [main] CMD: bwa mem /srv/GT/reference/Homo_sapiens/Ensembl/GRCh38.p10/Sequence/BWAIndex/genome.fa /export/local/susanne/circseq/data/20200904.0-HEK_circle_control_R1.fastq.gz /export/local/susan ne/circseq/data/20200904.0-HEK_circle_control_R2.fastq.gz [main] Real time: 2525.530 sec; CPU: 2551.220 sec [09/18 09:13:09PM][INFO][alignReads] Paired end mapping for control_HEK_PCSK9 completed. [09/18 09:13:09PM][INFO][alignReads] samtools sort -o /export/local/susanne/circseq/wqi/aligned/control_HEK_PCSK9.bam /export/local/susanne/circseq/wqi/aligned/control_HEK_PCSK9.sam [bam_sort_core] merging from 3 files and 1 in-memory blocks... [09/18 09:14:30PM][INFO][alignReads] Sorting by coordinate position for control_HEK_PCSK9 complete. [09/18 09:14:30PM][INFO][alignReads] samtools index /export/local/susanne/circseq/wqi/aligned/control_HEK_PCSK9.bam [09/18 09:14:37PM][INFO][alignReads] Indexing for control_HEK_PCSK9 complete. [09/18 09:14:37PM][INFO][alignReads] samtools sort -o /export/local/susanne/circseq/wqi/aligned/control_HEK_PCSK9_sorted.bam -n /export/local/susanne/circseq/wqi/aligned/control_HEK_PCSK9.bam [bam_sort_core] merging from 3 files and 1 in-memory blocks... [09/18 09:16:39PM][INFO][alignReads] Sorting for control_HEK_PCSK9 by name complete. [09/18 09:16:39PM][INFO][changeseq] Finished aligning reads to genome. [09/18 09:16:39PM][INFO][changeseq] Identifying off-target cleavage sites. [09/18 09:16:39PM][INFO][changeseq] Window: 3, MAPQ: 50, Gap: 3, Start 1, Mismatches 6, Search_Radius 20 Writing counts to /export/local/susanne/circseq/wqi/identified/HEK_PCSK9_count.txt Tabulate nuclease standard start positions. [E::idx_find_and_load] Could not retrieve index file for '/export/local/susanne/circseq/wqi/aligned/HEK_PCSK9_sorted.bam' /usr/local/ngseq/lib/python/HTSeq/init.py:589: UserWarning: Read M01742:268:000000000-CRWLL:1:1101:2089:11405 claims to have an aligned mate which could not be found in an adjacent line. "which could not be found in an adjacent line." ) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4. 9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 /usr/local/ngseq/lib/python/HTSeq/init.py:622: UserWarning: 28370 reads with missing mate encountered. warnings.warn( "%d reads with missing mate encountered." % mate_missing_count[0] ) Tabulate control standard start positions. [E::idx_find_and_load] Could not retrieve index file for '/export/local/susanne/circseq/wqi/aligned/control_HEK_PCSK9_sorted.bam' /usr/local/ngseq/lib/python/HTSeq/init.py:589: UserWarning: Read M01742:268:000000000-CRWLL:1:1101:1963:11844 claims to have an aligned mate which could not be found in an adjacent line. "which could not be found in an adjacent line." ) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4. 9 5.0 5.1 /usr/local/ngseq/lib/python/HTSeq/init.py:622: UserWarning: 19918 reads with missing mate encountered. warnings.warn( "%d reads with missing mate encountered." % mate_missing_count[0] ) Writing matched table Writing unmatched table

when I run samtools index from command line, the sorted bam files seemed unsorted; [E::hts_idx_push] Unsorted positions on sequence #20: 1675272 followed by 1675156 samtools index: failed to create index for "/export/local/susanne/circseq/wqi/aligned/control_HEK_PCSK9_sorted.bam"

when I rerun samtools sort manually, the sorted bam file (HEK_PCSK9.sorted.bam) has a different file size than the one (HEK_PCSK9_sorted.bam) generated in the changseq pipeline: -rw-rw-r--+ 1 qiwei SG_Employees 2977672 Sep 21 09:24 wqi/aligned/HEK_PCSK9.sorted.bam.bai -rw-rw-r--+ 1 qiwei SG_Employees 849354420 Sep 21 09:04 wqi/aligned/HEK_PCSK9.sorted.bam -rw-rw-r--+ 1 qiwei SG_Employees 918399388 Sep 18 20:31 wqi/aligned/HEK_PCSK9_sorted.bam -rw-rw-r--+ 1 qiwei SG_Employees 2977672 Sep 18 20:28 wqi/aligned/HEK_PCSK9.bam.bai -rw-rw-r--+ 1 qiwei SG_Employees 849354420 Sep 18 20:28 wqi/aligned/HEK_PCSK9.bam -rw-rw-r--+ 1 qiwei SG_Employees 3560235402 Sep 18 20:26 wqi/aligned/HEK_PCSK9.sam

samtools version in the pipeline: Version: 1.9 (using htslib 1.9) bwa version: 0.7.17-r1188

Kind regards, Weihong

weihongqi commented 4 years ago

I noticed that this happened when merged_analysis was set to false. In this case, in the function findCleavageSites bam files sorted by names (_sorted.bam) were used. These bam files don't have index files.

if self.merged_analysis: sorted_bam_file = os.path.join(self.analysis_folder, 'aligned', sample + '.bam') control_sorted_bam_file = os.path.join(self.analysis_folder, 'aligned', 'control_' + sample + '.bam') else: sorted_bam_file = os.path.join(self.analysis_folder, 'aligned', sample + '_sorted.bam') control_sorted_bam_file = os.path.join(self.analysis_folder, 'aligned', 'control_' + sample + '_sorted.bam')

YichaoOU commented 4 years ago

Hello,

We only use merged_analysis=True. As Shengdar mentioned in the email, this is more sensitive. Also, I think the code for merged analysis is also well developed.

Thanks, Yichao