open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

pairtools sort aborting with header error #196

Closed jctourtellotte closed 11 months ago

jctourtellotte commented 11 months ago

I am encountering a header error when trying to run pairtools sort.

[main_samview] fail to read the header from "-".

This is the contents of the shell script I am sending as a job to a cluster:

#!/bin/bash
CONDA_BASE=$(conda info --base) ; source $CONDA_BASE/etc/profile.d/conda.sh

sra_num=$1
data_src="data/GSE141139_Kang/"

cd
conda activate hic_env

pairtools sort -o ${data_src}hic_aligned/${sra_num}.sorted.pairs.bam ${data_src}hic_aligned/${sra_num}.pairs.gz

My header contents are as follows, via samtools:

> @SQ     SN:1    LN:249250621
> @SQ     SN:2    LN:243199373
> @SQ     SN:3    LN:198022430
> @SQ     SN:4    LN:191154276
> @SQ     SN:5    LN:180915260
> @SQ     SN:6    LN:171115067
> @SQ     SN:7    LN:159138663
> @SQ     SN:8    LN:146364022
> @SQ     SN:9    LN:141213431
> @SQ     SN:10   LN:135534747
> @SQ     SN:11   LN:135006516
> @SQ     SN:12   LN:133851895
> @SQ     SN:13   LN:115169878
> @SQ     SN:14   LN:107349540
> @SQ     SN:15   LN:102531392
> @SQ     SN:16   LN:90354753
> @SQ     SN:17   LN:81195210
> @SQ     SN:18   LN:78077248
> @SQ     SN:19   LN:59128983
> @SQ     SN:20   LN:63025520
> @SQ     SN:21   LN:48129895
> @SQ     SN:22   LN:51304566
> @SQ     SN:X    LN:155270560
> @SQ     SN:Y    LN:59373566
> @SQ     SN:MT   LN:16569
> @SQ     SN:GL000207.1   LN:4262
> @SQ     SN:GL000226.1   LN:15008
> @SQ     SN:GL000229.1   LN:19913
> @SQ     SN:GL000231.1   LN:27386
> @SQ     SN:GL000210.1   LN:27682
> @SQ     SN:GL000239.1   LN:33824
> @SQ     SN:GL000235.1   LN:34474
> @SQ     SN:GL000201.1   LN:36148
> @SQ     SN:GL000247.1   LN:36422
> @SQ     SN:GL000245.1   LN:36651
> @SQ     SN:GL000197.1   LN:37175
> @SQ     SN:GL000203.1   LN:37498
> @SQ     SN:GL000246.1   LN:38154
> @SQ     SN:GL000249.1   LN:38502
> @SQ     SN:GL000196.1   LN:38914
> @SQ     SN:GL000248.1   LN:39786
> @SQ     SN:GL000244.1   LN:39929
> @SQ     SN:GL000238.1   LN:39939
> @SQ     SN:GL000202.1   LN:40103
> @SQ     SN:GL000234.1   LN:40531
> @SQ     SN:GL000232.1   LN:40652
> @SQ     SN:GL000206.1   LN:41001
> @SQ     SN:GL000240.1   LN:41933
> @SQ     SN:GL000236.1   LN:41934
> @SQ     SN:GL000241.1   LN:42152
> @SQ     SN:GL000243.1   LN:43341
> @SQ     SN:GL000242.1   LN:43523
> @SQ     SN:GL000230.1   LN:43691
> @SQ     SN:GL000237.1   LN:45867
> @SQ     SN:GL000233.1   LN:45941
> @SQ     SN:GL000204.1   LN:81310
> @SQ     SN:GL000198.1   LN:90085
> @SQ     SN:GL000208.1   LN:92689
> @SQ     SN:GL000191.1   LN:106433
> @SQ     SN:GL000227.1   LN:128374
> @SQ     SN:GL000228.1   LN:129120
> @SQ     SN:GL000214.1   LN:137718
> @SQ     SN:GL000221.1   LN:155397
> @SQ     SN:GL000209.1   LN:159169
> @SQ     SN:GL000218.1   LN:161147
> @SQ     SN:GL000220.1   LN:161802
> @SQ     SN:GL000213.1   LN:164239
> @SQ     SN:GL000211.1   LN:166566
> @SQ     SN:GL000199.1   LN:169874
> @SQ     SN:GL000217.1   LN:172149
> @SQ     SN:GL000216.1   LN:172294
> @SQ     SN:GL000215.1   LN:172545
> @SQ     SN:GL000205.1   LN:174588
> @SQ     SN:GL000219.1   LN:179198
> @SQ     SN:GL000224.1   LN:179693
> @SQ     SN:GL000223.1   LN:180455
> @SQ     SN:GL000195.1   LN:182896
> @SQ     SN:GL000212.1   LN:186858
> @SQ     SN:GL000222.1   LN:186861
> @SQ     SN:GL000200.1   LN:187035
> @SQ     SN:GL000193.1   LN:189789
> @SQ     SN:GL000194.1   LN:191469
> @SQ     SN:GL000225.1   LN:211173
> @SQ     SN:GL000192.1   LN:547496
> @PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -SP5M -t 40 ref_genomes/hg19/bwa_hg19 data/GSE141139_Kang/fastq/SRR10539842_1.fastq.gz data/GSE141139_Kang/fastq/SRR10539842_2.fastq.gz

I built the chromsizes file using faidx. The python environment activated was setup specifically for using pairtools.

Where am I going wrong? It feels like it might be quite obvious, but perhaps not.

Phlya commented 11 months ago

pairtools sort is designed to sort pair files, not sam... If I am not completely missing the point somehow (also, I think your input and output arguments are swapped, i.e. -o should point to pairs.gz, not .bam)

jctourtellotte commented 11 months ago

pairtools sort is designed to sort pair files, not sam... If I am not completely missing the point somehow (also, I think your input and output arguments are swapped, i.e. -o should point to pairs.gz, not .bam)

pairtools sort --nproc 5 -o test.sorted.pairs.gz test.pairs.gz (from the docs)

I do not have the input and output swapped, see code example above from the docs, output is before input–which generally feels counter inituive to me but has been the case for a few scripts. What I DO have, and thank you for pointing it out, is a ".bam" that should be a ".gz". I do not know how I did not catch that! Thank you @Phlya Hopefully, fixing that fixes the issue.

Phlya commented 11 months ago

OK I just assumed you wanted to sort a bam file and output pairs, hence I thought you swapped the paths. Hope this solved the problem, feel free to reopen if you have more issues with this.

FYI if you don't provide the -o argument pairtools will simply output to stdout, so you can redirect it to any tool or file of choice, and then have the output after input.