Dip-c vs hickit for preprocessing

tarak77 commented 5 years ago

Hi Tan,

Happy Monday! I remember reading in the readme of this repo that hickit is much faster than dip-c. I know that hickit does not implement the 3D imputation step like Dip-c does. But during the read alignment, both use BWA MEM aligner- this aligner is based on BWT/FM-index and gives improved performance but a large index build up time.

1. Why is hickit faster for preprocessing steps?
1. When it comes to alignment, for the FASTQ files of one cell it took less than a day to complete using BWA MEM. Can we use/develop faster methods for good alignment?
1. The 3D modeling steps didn't take much time compared to the preprocessing.

I ask these questions because instead of working on a single cell, I wonder how much time it will take to generate many models from many cells. Generating m number of models will help in downstream analysis. I know that we can generate m models for a given cell explicitly https://github.com/tanlongzhi/dip-c/issues/5#issuecomment-421641544 , and we could do this in parallel for many cells?

Thanks again!

tanlongzhi commented 5 years ago

Hi Tarak,

Hickit is faster in preprocessing because this repo was not designed to perform.
Currently, @lh3's bwa-mem and minimap2 are two of the fastest alignment tools.
Yes you can easily write a for loop to iterate over many cells.

tanlongzhi commented 5 years ago

Thanks to @lh3's awesome implementation, hickit is especially fast in two steps: (1) hickit.js reads the SNP list into the memory and then does a single, very fast pass over the SAM file; and (2) hickit does very fast dedup by efficiently scanning the contacts. Besides preprocessing, hickit is also fast in imputation and 3D modeling.

tarak77 commented 5 years ago

Thanks @tanlongzhi @lh3. Would it be awesome for hickit to implement the 3D imputation step as well!? When I compare the 3D models between the two repo, dip-c gives more pronounced and contiguous at 20kb resolution.

tanlongzhi commented 5 years ago

The seemingly "noncontiguous" hickit models are in fact a clever design of adaptive genomic resolution, where 3D particles (20 kb each, in this case) that harbor no contacts are automatically merged with nearby particles to form larger particles (40 kb each, or larger). This can be a better way to handle repetitive regions such as the centromeres, and has nothing to do with 3D vs 2D imputation.

This mode can be turned off to produce more dip-c-/nuc_dynamics-like models by the hickit -M option.

tarak77 commented 5 years ago

Sorry, but why is it a better way to handle repetitive regions like centromeres?

ALso, in https://github.com/tanlongzhi/dip-c/issues/16#issuecomment-424486496 you mentioned about more realistic modeling of the chromatin, did you mean how the coarse grained chromatin polymer is defined? Using Strings and Binders Switch models will give more realistic conformations?

tanlongzhi commented 5 years ago

In dip-c/nuc_dynamics, regions that harbor no contacts (and therefore no constraints) will float away from the nucleus, forming large, free-floating loops either outside or on the surface of the nucleus. Although these regions are later deleted (for example, by dip-c clean3), such floating behavior is certainly incorrect.

Your understanding is correct. Recent super-resolution studies sometimes show a variety of complicated chromatin shapes. Perhaps the models you mentioned will help.

tanlongzhi commented 5 years ago

Btw this repo is now updated (a05858fcd58fa959aec49ba98dba22190d40231b) with dip-c vis -a to connect adjacent beads regardless of their genomic separation (bp) to support hickit 3D models.

tarak77 commented 5 years ago

I see, I have some observations from the following steps I used for my new data:

As hickitis more faster in preprocessing steps, I implement hickit before dip-c for modelling
I convert contact.pairs and impute.pairs from hickit to conatcts.con.gz and impute.con.gzrespectively
I apply the different rounds of 3D imputation and modelling given in the readme of this repo with the default resolutions. Note that while implementing dip-c impute3 -3 i use contacts.con.gz instead of clean.con.gz as my second argument.
Finally i convert my .3dg to .ciffile using the default parameters in readme , without the recent modification https://github.com/tanlongzhi/dip-c/issues/17#issuecomment-426632918

Below is the screenshot shown for chr2 mat and pat dip-c

As you can see these extended and free floating loops were also seen in some other chromosomal pairs. Mostly these look like defects and not the inherit structure. Even after all the dip-c clean3 -csteps from readme they were present. Can we remove these defects manually?

The structures differed when we compare them with the modeling from .3dg file of hickit. For getting the .cif from .3dg of hickit, i did:

I generated the .3dg file at 20kb resolution from hickit
Then i used the scripts/hickit_3dg_to_3dg_rescale.sh to convert to dip-c .3dg file.
Then i converted to .ciffile using the default command mentioned in readme. Again i didnt use the modificationdip-c vis -a I get the following screenshot for chr2 mat and pat

From hickit .3dg files, i didnt see these open/floating loopy structure. But i could see these non contiguous particles. And again as you mentioned is that we should find 3D models from hickit as a lower resolution?

But again i have not tried the hickit -M option nor the dip-c vis -a. From what i have, dip-c models look much better(except these defects) than hickit. Probably the updates you mentioned will make hickit models even better?

lh3 commented 5 years ago

dip-c models look much better

This is interesting. Why do you think the dip-c model looks much better? Because the backbone threads look smoother? Does the 3d model of haploid hi-c also look smooth?

For the published pbmc-11, dip-c imputation gives us 494k contacts, while hickit gives 694k. One possibility is that hickit is too aggressive and imputes non-existing contacts which distort the backbone threads.

tarak77 commented 5 years ago

@lh3 Just by looking, yes dip-c looks better, but its these free floating loops @tanlongzhi was talking about- are still present even after dip-c clean3. Is there a way to manually remove them?

Regarding hickit, as @tanlongzhi mentioned in https://github.com/lh3/hickit/issues/5#issuecomment-421670815, finding models at low resolution will help? For haploid cell, dip-c/nuc_dynamics did give a smooth backbone threads but again some free floating loops at 100kb, compared to hickit where there were no free floating loops.

Again, for the diploid cell, in above comments- were the screenshots for chr2, below are the screnshots for chr17 at 20kb:

dip-c dip-c_chr17

hickit hickit_chr17

tanlongzhi commented 5 years ago

It seems to me that in both cases, you're observing aberrantly stretched backbones rather than free-floating loops. (Free-floating loops are soft and bendy; these are very straight.)

This seems to be the result of incorrect imputation, mistakenly assigning intrahomologous contacts to interhomologous contacts.

tanlongzhi commented 5 years ago

Btw please make sure to use the latest scripts/hickit_3dg_to_3dg_rescale.sh because I recently fixed a bug of forgetting to multiply by 2.

tarak77 commented 5 years ago

Ok, i am also trying to use dip-c vis -aon my new rescaled .3dg files from hickit, but get the error:

.
.
M::color] analyzed 171000 particles (99.55%)
[M::color] writing 171776 colors (100.0%)
Traceback (most recent call last):
  File "./dip-c", line 130, in <module>
    main()
  File "./dip-c", line 84, in main
    return_value = vis.vis(sys.argv[1:])
  File "/home/tshisode/dip-c/vis.py", line 58, in vis
    g3d_data = file_to_g3d_data(open(args[0], "rb"))
  File "/home/tshisode/dip-c/classes.py", line 1427, in file_to_g3d_data
    g3d_data.add_g3d_particle(string_to_g3d_particle(g3d_file_line.strip()))
  File "/home/tshisode/dip-c/classes.py", line 1214, in string_to_g3d_particle
    hom_name, ref_locus, x, y, z = g3d_particle_string.split("\t")
ValueError: need more than 3 values to unpack
Traceback (most recent call last):
  File "./dip-c", line 130, in <module>
    main()
  File "./dip-c", line 75, in main
    return_value = color.color(sys.argv[1:])
  File "/home/tshisode/dip-c/color.py", line 246, in color
    sys.stdout.write("\t".join([hom_name, str(ref_locus), str(color_data[(hom_name, ref_locus)])]) + "\n")
IOError: [Errno 32] Broken pipe

Any help?

lh3 commented 5 years ago

Just by looking, yes dip-c looks better

I was not talking about "free floating loops". I know those are probably imputation errors as @tanlongzhi said. I am asking how you know dip-c looks better. Is it only because it looks smoother? Is smoother always more correct?

tarak77 commented 5 years ago

Yes @lh3, its just the smoothness. Its may not be necessary that smoother is always correct. Need to do more analysis on many cells.

tarak77 commented 5 years ago

Hi @tanlongzhi , How to implement dip-c vis -a https://github.com/tanlongzhi/dip-c/issues/17#issuecomment-426632918 ? to connect adjacent beads to support hickit models.

tanlongzhi commented 5 years ago

You can run something like this:

dip-c color -n color/hg19.chr.txt cell.3dg | dip-c vis -a -c /dev/stdin cell.3dg > cell.n.cif

tarak77 commented 5 years ago

cool

tarak77 commented 5 years ago

A quick question: is it possible/valid to run the 3D modeling and imputation steps of dip-c on many threads like bwa mem does?

tanlongzhi commented 5 years ago

Unfortunately multithreaded modeling and imputation are not currently supported.

tarak77 commented 5 years ago

Okay, is it valid to multithread though?

tarak77 commented 5 years ago

Hi @tanlongzhi ,

I recently noticed that you added hickit_3dg_to_3dg_rescale_backbone.sh , hickit_3dg_to_3dg_rescale_unit.sh in scripts. Which should I use to convert hicks.3dg to dip-c.3dg?

tanlongzhi commented 5 years ago

Hi Tarak,

The script hickit_3dg_to_3dg_rescale_unit.sh uses the new unit information that @lh3 kindly added (lh3/hickit@04dd6f04fc0678c06c054715285bdc4914df8332) to the header of hickit .3dg files. This script will set the optimal backbone distance to 2.0, which should be the same setting as nuc_dynamics.

The script hickit_3dg_to_3dg_rescale_backbone.sh instead sets the median of the actual backbone distance to 2.0, and can work on old and new hickit .3dg files.

Neither of the scripts has been tested thoroughly.

tarak77 commented 5 years ago

Okay I see now.

Just thinking out loud, and sorry for going back and forth, but is it valid to say that multithreaded could be used for modeling and 3D imputations for this repo?

tanlongzhi commented 5 years ago

I'm not sure. You can probably ask 3D experts. Currently, multithreading is not my top priority, because for each cell, we typically needs 3-5 replicates. Each replicate is run as a separate thread, which is already a 3-5X speedup.

tarak77 commented 5 years ago

Okay cool

jijibio commented 5 years ago

Hello tanlongzhi,

Can you let me know how you generated fig.2F from the paper (Three-dimensional genome structures of single diploid human cells)

Thanks in advance

tanlongzhi commented 5 years ago

Hi @jijibio,

Below are some of my PyMol codes for one of Fig. 2F's panels. You must start with an mmCIF file colored by dip-c color -l.

First you'll need to select the highlighted domain and name it tad:

tad_chain = "2(pat)"
tad_start = 206840000
tad_end = 209280000

tad_start_string = str(tad_start).rjust(9,'0')
tad_start_resn = tad_start_string[0:3]
tad_start_name = tad_start_string[3:6]
cmd.select("tad_start", "chain \"" + tad_chain + "\" and resn " + tad_start_resn + " and name " + tad_start_name)
tad_start_id = cmd.id_atom("tad_start")

tad_end_string = str(tad_end).rjust(9,'0')
tad_end_resn = tad_end_string[0:3]
tad_end_name = tad_end_string[3:6]
cmd.select("tad_end", "chain \"" + tad_chain + "\" and resn " + tad_end_resn + " and name " + tad_end_name)
tad_end_id = cmd.id_atom("tad_end")

cmd.select("tad", "id " + str(tad_start_id) + "-" + str(tad_end_id))

In the same way, you can select the gray domain and name it old_tad:

tad_chain = "2(pat)"
tad_start = 204880000
tad_end = 221360000

tad_start_string = str(tad_start).rjust(9,'0')
tad_start_resn = tad_start_string[0:3]
tad_start_name = tad_start_string[3:6]
cmd.select("tad_start", "chain \"" + tad_chain + "\" and resn " + tad_start_resn + " and name " + tad_start_name)
tad_start_id = cmd.id_atom("tad_start")

tad_end_string = str(tad_end).rjust(9,'0')
tad_end_resn = tad_end_string[0:3]
tad_end_name = tad_end_string[3:6]
cmd.select("tad_end", "chain \"" + tad_chain + "\" and resn " + tad_end_resn + " and name " + tad_end_name)
tad_end_id = cmd.id_atom("tad_end")

cmd.select("old_tad", "id " + str(tad_start_id) + "-" + str(tad_end_id))

After positioning the camera, you can generate a figure for the highlighted domain:

viewport 400, 400
set ray_shadows,0
hide all
select chr, chain "2(pat)"
color white, old_tad
as sticks, tad
set_bond stick_radius, 0.5, all
spectrum b, rainbow, tad
png tad.png, 800, 800, ray=1

Then another figure with the gray domain, too:

show sticks, old_tad
png old_tad.png, 800, 800, ray=1

Finally, you can overlay the two images (for example, in Illustrator), with old_tad.png on top and half transparent.

jijibio commented 5 years ago

Hello tanlongzhi,

Thanks for your quick response. I am working with bulk HIC data, Is it possible to get the 3D model for the same.

Thanks

tanlongzhi commented 5 years ago

Hi @jijibio,

The 3D modeling part of this repo is primarily geared towards single cells, not bulk (the same as this thread we had in the hickit repo).

For bulk 3D modeling, there have been multiple papers on this subject in the past few years. Please refer to those for your purpose.

liubinnk1 commented 3 years ago

Hi Dr. Tan,

I recently run the dip-c seg and got errors as follows:

Traceback (most recent call last): File "/THL8/home/liubin/software/dip-c-master/dip-c", line 130, in main() File "/THL8/home/liubin/software/dip-c-master/dip-c", line 42, in main return_value = seg.seg(sys.argv[1:]) File "/THL8/home/liubin/software/dip-c-master/seg.py", line 115, in seg seg_data.clean() File "/THL8/home/liubin/software/dip-c-master/classes.py", line 204, in clean for name in self.reads.keys(): RuntimeError: dictionary changed size during iteration

tanlongzhi commented 3 years ago

Hi @liubinnk1,

Just to confirm, did you use Python 2 rather than 3? This repo is only compatible with Python 2.

In addition, our latest workflow (see README) uses hickit for this pre-processing step, which is much faster.

Best, Tan

liubinnk1 commented 3 years ago

Hi Tan,

I solved that problem by using python 2.7. Thank you so much for your reply.

Best,

Bin Liu

发件人：Longzhi Tan notifications@github.com 发送日期：2020-10-10 06:15:27 收件人：tanlongzhi/dip-c dip-c@noreply.github.com 抄送人：liubinnk1 liubinnk@nankai.edu.cn,Mention mention@noreply.github.com 主题：Re: [tanlongzhi/dip-c] Dip-c vs hickit for preprocessing (#17)

tanlongzhi / dip-c

Dip-c vs hickit for preprocessing #17