open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

pairtools parse keyError #88

Closed sckinta closed 4 years ago

sckinta commented 4 years ago

I am trying to use pairtools parse to convert a bam file generated by hicup to pairs.

pairtools parse -c chrom_hg19.sizes --drop-sam --cmd-in "samtools view --no-PG" dedup.bam

It kept reporting error

Traceback (most recent call last):
  File "/home/suc1/.conda/envs/py36/bin/pairtools", line 11, in <module>
    sys.exit(cli())
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/__init__.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/pairtools_parse.py", line 156, in parse
    output_parsed_alignments, output_stats, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/pairtools_parse.py", line 206, in parse_py
    chromsizes = [(chrom, sam_chromsizes[chrom]) for chrom in chromosomes],
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/pairtools_parse.py", line 206, in <listcomp>
    chromsizes = [(chrom, sam_chromsizes[chrom]) for chrom in chromosomes],
KeyError: 'chr1'

I double checked the @SQ header in bam file and chrom.size file chromosome. They are matched and show the same order.

(py36) [suc1@l-1-01 hicup]$ samtools view -h --no-PG merged.dedup.bam | head -n 26
@HD VN:1.0  SO:unsorted
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10    LN:135534747
@SQ SN:chr11    LN:135006516
@SQ SN:chr12    LN:133851895
@SQ SN:chr13    LN:115169878
@SQ SN:chr14    LN:107349540
@SQ SN:chr15    LN:102531392
@SQ SN:chr16    LN:90354753
@SQ SN:chr17    LN:81195210
@SQ SN:chr18    LN:78077248
@SQ SN:chr19    LN:59128983
@SQ SN:chr20    LN:63025520
@SQ SN:chr21    LN:48129895
@SQ SN:chr22    LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chrM LN:16571

chrom_hg19.sizes is attached here

pairtools parse worked appropriately with MATalpha test data and its corresponding chrom.size

Could you help solve this problem? Thank you!!

golobor commented 4 years ago

Dear Chun, many thanks for this detailed bug report!! I suspect that the issue lies in the fact that you are missing a -h flag in your cmd-in, i.e. it should be "samtools view -h --no-PG". If it's indeed the problem, pairtools is partially to blam - we should tell the user when the header is missing. I'll introduce a short error message. Again, many thanks for reaching out! Anton.

sckinta commented 4 years ago

Hi Anton,

Thank you for quick reply! I tried your suggestion pairtools parse -c chrom_hg19.sizes --drop-sam --cmd-in "samtools view -h --no-PG" dedup.bam. It reports new error this time.

Traceback (most recent call last):
  File "/home/suc1/.conda/envs/py36/bin/pairtools", line 11, in <module>
    sys.exit(cli())
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/__init__.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/pairtools_parse.py", line 156, in parse
    output_parsed_alignments, output_stats, **kwargs)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/pairtools_parse.py", line 213, in parse_py
    header = _headerops.append_new_pg(header, ID=UTIL_NAME, PN=UTIL_NAME)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/_headerops.py", line 184, in append_new_pg
    new_samheader = _add_pg_to_samheader(samheader, ID, PN, VN, CL, force)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/_headerops.py", line 246, in _add_pg_to_samheader
    pg_chains = _parse_pg_chains(samheader, force=force)
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/_headerops.py", line 281, in _parse_pg_chains
    for l in header
  File "/home/suc1/.conda/envs/py36/lib/python3.6/site-packages/pairtools/_headerops.py", line 282, in <listcomp>
    if l.startswith('@PG')
ValueError: dictionary update sequence element #0 has length 1; 2 is required

It looks like the problems lies at@PG handling to me. However, if I do not use --no-PG, I have issue withsamtools view instead.

$ samtools view merged.dedup.bam
[E::sam_hrecs_error] Malformed key:value pair at line 32: "@PG  HiCUP Deduplicator  VN:0.5.9"
samtools view: failed to add PG line to the header

This has to do with samtools version. When I installed pairtools in a conda environment, samtools (htslib 1.11) was installed as accessory (I think). I tested that samtools (htslib 1.7) works fine with my bam file without --no-PG. The error comes from samtools (htslib 1.11) only.

I am wondering whether it is necessarily to keep PG group in pairtools parse. Does pairtools have to compare with samtools version 1.11. Is it possible to work around it?

Thank you!!!

golobor commented 4 years ago

ah, okay, so the issue is that the HiCUP Deduplicator produces .sam files that do not adhere to the SAM format. According to the official documentation, "In the header, each line is TAB-delimited and, apart from @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-character string that defines the format and content of VALUE". The @PG line that you showed doesn't follow this recommendation. :( I'll commit a fix in a few minutes!

sckinta commented 4 years ago

Aha, Thank you! I just found some bam files I generated using newer version of HiCUP. They fixed "TAG:VALUE" for @PG. I just tested pairtools on those new bam files, it works well. Thank you!

golobor commented 4 years ago

oh, nice!! i did introduce the fix too, just in case!