Open nservant opened 3 years ago
Hi @nservant,
I have a bed file with genomic intervals of 1kb, from chrX:150125000-153125000
Bin tables must start at 0 and should end at the chromosome length, even if the data is restricted to a smaller region. The first bin could simply be [0, 150125000) and the last [153125000, chrX_size).
Let me know if that works.
ok. I'll try and let you know if it works.
What would be the best way to go for this type of data according to you ?
Should I add the first/last bins as you just suggested, or should I continue to use the whole chromosome ?
I'm a bit concerned about the impact on balancing and downstream analysis (insulation, dots, compartments with cooltools
) ?
Thanks
I support the idea we should look more into this btw in general. More and more people do this kind of analysis.
I've just been using regular genome-wide equal binning for this sort of data, this way you also get a view of the contacts that are only captured on one end, and they are not completely useless potentially.
Hi @Phlya, I agree with you. The only point is that I frequently have some balancing issues using the contacts that are only captured on one end ... while focusing on the targeted region usually works well. Though I never really deeply investigate the reason. If you want to make additional tests, let me know, I'll be happy to exchange on that.
You don't even need to use them for balancing, just storing them in the cooler so you can see them if you want... In my limited experience balancing small regions is not 100% reliable in general unfortunately, need to modify filtering sometimes. But would be good to assemble some test data to check how the tools work with it.
One interesting feature which could help on that would simply be to have a tool which could extract a sub-matrix from a cooler object. Then you could imagine to generate a genome-wide object and restrict the downstream analysis such as TADs calling ...
Hi,
An update of this topic. @nvictus, I tried to add the first/last bins.
It is now working with csort
+ cload pairix
.
However, it's still crashing with cload pairs
current_peek = inbuffer.peek()
TypeError: peek() missing 1 required positional argument: 'n'
Then, using the cool
files generated by csort
+ cload pairix
...
balance
is OKzoomify
crashes: TypeError: cannot convert the series to <class 'float'>
diamond-insulation
crashes: TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
N
I've run into the same zoomify issue.
I just wanted to say that I always use whole-genome binning, as if it was whole-genome Hi-C, and never had any issues like that. Just provide a blacklist to balancing, which would make it ignore most of the genome.
Hi @nvictus As quickly discussed, I tried to build a cooler object on a 3Mb Hi-C region from capture Hi-C data. I'm facing several issues according to the test I run ...
I have a bed file with genomic intervals of 1kb, from
chrX:150125000-153125000
Accordinly, I extracted my pairs within the same genomic range ;Then, I simply try to ingest the data with
cload pairs
Then, I had a try with
cooler pairx
The
csort
command works (although I put here the entire chrX size ? not sure what to put otherwise ...), but thecload pairix
also crashed ...Of note, I also reported the same error in
cooltools
earlier, when trying to bin a small genome https://github.com/open2c/cooltools/issues/237