open2c / cooler

A cool place to store your Hi-C
https://open2c.github.io/cooler
BSD 3-Clause "New" or "Revised" License
196 stars 52 forks source link

testing cooler on capture Hi-C data #251

Open nservant opened 3 years ago

nservant commented 3 years ago

Hi @nvictus As quickly discussed, I tried to build a cooler object on a 3Mb Hi-C region from capture Hi-C data. I'm facing several issues according to the test I run ...

I have a bed file with genomic intervals of 1kb, from chrX:150125000-153125000 Accordinly, I extracted my pairs within the same genomic range ;

>>zcat contacts.txt.gz | head -1
NB501764:1043:HM5FKBGXF:2:23302:4878:10125  chrX    150125462   chrX    150163625   +   +
>>zcat contacts.txt.gz | tail -1
NB501764:1043:HM5FKBGXF:3:21505:23621:19037 chrX    153121783   chrX    153123912   -   -

Then, I simply try to ingest the data with cload pairs

>>cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 target.bed contacts.txt.gz test.cool
INFO:cooler.create:Writing chunk 0: /data/kdi_prod/.kdi/project_workspace_0/1309/acl/01.00/downstreamAnalysis/scripts/tmpenmswn78.multi.cool::0
INFO:cooler.create:Creating cooler at "/data/kdi_prod/.kdi/project_workspace_0/1309/acl/01.00/downstreamAnalysis/scripts/tmpenmswn78.multi.cool::/0"
INFO:cooler.create:Writing chroms
INFO:cooler.create:Writing bins
INFO:cooler.create:Writing pixels
INFO:cooler.create:Writing indexes
INFO:cooler.create:Writing info
INFO:cooler.create:Done
INFO:cooler.create:Merging into test.cool
INFO:cooler.create:Creating cooler at "test.cool::/"
INFO:cooler.create:Writing chroms
INFO:cooler.create:Writing bins
INFO:cooler.create:Writing pixels
INFO:cooler.reduce:nnzs: [377101]
INFO:cooler.reduce:current: [0]
Traceback (most recent call last):
  File "/data/users/nservant/projects_analysis/kdi_home/conda/nf-core-hic/bin/cooler", line 10, in <module>
    sys.exit(cli())
  File "/data/users/nservant/projects_analysis/kdi_home/conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/data/users/nservant/projects_analysis/kdi_home/conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/data/users/nservant/projects_analysis/kdi_home/conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/users/nservant/projects_analysis/kdi_home/conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/users/nservant/projects_analysis/kdi_home/conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/cli/cload.py", line 492, in pairs
    ordered=False
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_create.py", line 944, in create_cooler
    max_merge=max_merge)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_create.py", line 687, in create_from_unordered
    **kwargs)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_create.py", line 577, in create
    file_path, target, meta.columns, iterable, h5opts, lock)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_create.py", line 213, in write_pixels
    for i, chunk in enumerate(iterable):
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/reduce.py", line 163, in __iter__
    ignore_index=True)
  File "conda/nf-core-hic/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 255, in concat
    sort=sort,
  File "conda/nf-core-hic/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 304, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

Then, I had a try with cooler pairx

cooler csort -c1 2 -p1 3 -s1 6 -c2 4 -p2 5 -s2 7 -o contacts.csort.txt.gz contacts.txt.gz chrX.sizes
cooler cload pairix target.bed contacts.csort.txt.gz test.cool       
INFO:cooler.cli.csort:Enumerating requested chromosomes...
INFO:cooler.cli.csort:chrX  1
INFO:cooler.cli.csort:Input: '../Jarid_NPC_B1/cooler/Jarid_NPC_B1_contacts.txt.gz'
INFO:cooler.cli.csort:Output: '../Jarid_NPC_B1/cooler/Jarid_NPC_B1_contacts.csort.txt.gz'
INFO:cooler.cli.csort:Reordering pair mates and sorting pair records...
INFO:cooler.cli.csort:Sort order: block (chrom1, chrom2, pos1, pos2)
INFO:cooler.cli.csort:sort -k2,2 -k4,4 -k3,3n -k5,5n --parallel=8 --buffer-size=50%
INFO:cooler.cli.csort:Indexing...
INFO:cooler.cli.csort:Indexer: pairix
INFO:cooler.cli.csort:pairix -f -s2 -d4 -b3 -e3 -u5 -v5 ../Jarid_NPC_B1/cooler/Jarid_NPC_B1_contacts.csort.txt.gz
INFO:cooler.cli.cload:Using 8 cores
INFO:cooler.create:Creating cooler at "../Jarid_NPC_B1/cooler/Jarid_NPC_B1_1000.cool::/"
INFO:cooler.create:Writing chroms
INFO:cooler.create:Writing bins
INFO:cooler.create:Writing pixels
INFO:cooler.create:Binning chrX:150125000-151625000|*
INFO:cooler.create:Binning chrX:151625000-153125000|*
INFO:cooler.create:Finished chrX:151625000-153125000|*
INFO:cooler.create:Finished chrX:150125000-151625000|*
Traceback (most recent call last):
  File "conda/nf-core-hic/bin/cooler", line 10, in <module>
    sys.exit(cli())
  File "conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "conda/nf-core-hic/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/cli/cload.py", line 238, in pairix
    ordered=True)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_create.py", line 925, in create_cooler
    lock=lock)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_create.py", line 577, in create
    file_path, target, meta.columns, iterable, h5opts, lock)
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_create.py", line 213, in write_pixels
    for i, chunk in enumerate(iterable):
  File "conda/nf-core-hic/lib/python3.7/site-packages/cooler/create/_ingest.py", line 292, in _validate_pixels
    "Found a bin ID that exceeds the declared number of bins. "
cooler.create._ingest.BadInputError: Found a bin ID that exceeds the declared number of bins. Check whether your bin table is correct.

The csort command works (although I put here the entire chrX size ? not sure what to put otherwise ...), but the cload pairix also crashed ...

Of note, I also reported the same error in cooltools earlier, when trying to bin a small genome https://github.com/open2c/cooltools/issues/237

cooler -V cooler, version 0.8.6.post0

nvictus commented 3 years ago

Hi @nservant,

I have a bed file with genomic intervals of 1kb, from chrX:150125000-153125000

Bin tables must start at 0 and should end at the chromosome length, even if the data is restricted to a smaller region. The first bin could simply be [0, 150125000) and the last [153125000, chrX_size).

Let me know if that works.

nservant commented 3 years ago

ok. I'll try and let you know if it works. What would be the best way to go for this type of data according to you ? Should I add the first/last bins as you just suggested, or should I continue to use the whole chromosome ? I'm a bit concerned about the impact on balancing and downstream analysis (insulation, dots, compartments with cooltools) ? Thanks

Phlya commented 3 years ago

I support the idea we should look more into this btw in general. More and more people do this kind of analysis.

I've just been using regular genome-wide equal binning for this sort of data, this way you also get a view of the contacts that are only captured on one end, and they are not completely useless potentially.

nservant commented 3 years ago

Hi @Phlya, I agree with you. The only point is that I frequently have some balancing issues using the contacts that are only captured on one end ... while focusing on the targeted region usually works well. Though I never really deeply investigate the reason. If you want to make additional tests, let me know, I'll be happy to exchange on that.

Phlya commented 3 years ago

You don't even need to use them for balancing, just storing them in the cooler so you can see them if you want... In my limited experience balancing small regions is not 100% reliable in general unfortunately, need to modify filtering sometimes. But would be good to assemble some test data to check how the tools work with it.

nservant commented 3 years ago

One interesting feature which could help on that would simply be to have a tool which could extract a sub-matrix from a cooler object. Then you could imagine to generate a genome-wide object and restrict the downstream analysis such as TADs calling ...

nservant commented 3 years ago

Hi,

An update of this topic. @nvictus, I tried to add the first/last bins.

It is now working with csort + cload pairix. However, it's still crashing with cload pairs

current_peek = inbuffer.peek()
TypeError: peek() missing 1 required positional argument: 'n'

Then, using the cool files generated by csort + cload pairix ...

N

aakashsur commented 2 years ago

I've run into the same zoomify issue.

Phlya commented 2 years ago

I just wanted to say that I always use whole-genome binning, as if it was whole-genome Hi-C, and never had any issues like that. Just provide a blacklist to balancing, which would make it ignore most of the genome.