tanlongzhi / dip-c

Tools to analyze Dip-C (or other 3C/Hi-C) data
61 stars 18 forks source link

Try to understand the analysis #26

Closed kainblue closed 5 years ago

kainblue commented 5 years ago

Hi Longzhi,

I am very interested in learning construct 3D model from HiC data. After a long search, I recently find your fantastic work. I have downloaded fastq file for for GM12878 cell 1 from https://www.ncbi.nlm.nih.gov/sra/SRX4133191 , and now try to follow the instruction in this repo. However, I have problems during the further imputing steps: "con_to_ncc.sh impute.con.gz nuc_dynamics.sh impute.ncc 0.1 dip-c impute3 -3 impute.3dg clean.con.gz | gzip -c > impute3.round1.con.gz dip-c clean3 -c impute.con.gz impute.3dg > impute.clean.3dg

con_to_ncc.sh impute3.round1.con.gz nuc_dynamics.sh impute3.round1.ncc 0.1 dip-c impute3 -3 impute3.round1.3dg clean.con.gz | gzip -c > impute3.round2.con.gz dip-c clean3 -c impute3.round1.con.gz impute3.round1.3dg > impute3.round1.clean.3dg

con_to_ncc.sh impute3.round2.con.gz nuc_dynamics.sh impute3.round2.ncc 0.1 dip-c impute3 -3 impute3.round2.3dg clean.con.gz | gzip -c > impute3.round3.con.gz dip-c clean3 -c impute3.round2.con.gz impute3.round2.3dg > impute3.round2.clean.3dg ... " I feel like the cleaned 3dg file by clean3 at each step is not involved in the next round. The reason I start to question about this is actually because one error I encountered: dip-c impute3 -3 GM12878_cell1_dipc_phased.clean.impute.clean.3dg GM12878_cell1_dipc_phased.clean.con.gz | gzip -c > GM12878_cell1_dipc_phased.impute3.round1.con.gz [M::impute3] read a 3D structure with 55404 particles at 100000 bp resolution [M::impute3] read 612536 contacts (82.47% intra-chromosomal, 8.94% legs phased) [M::classes] imputed haplotypes for chromosome pair (13,17): 392 contacts (85.2% phased) [M::classes] imputed haplotypes for chromosome pair (5,8): 1679 contacts (97.74% phased) [M::classes] imputed haplotypes for chromosome pair (16,17): 216 contacts (66.2% phased) [M::classes] imputed haplotypes for chromosome pair (1,20): 1078 contacts (92.76% phased) Traceback (most recent call last): File "dip-c", line 130, in main() File "dip-c", line 63, in main return_value = impute3.impute3(sys.argv[1:]) File "impute3.py", line 109, in impute3 con_data.impute_from_g3d_data(g3d_data, max_impute3_distance, max_impute3_ratio, max_impute3_ratio * g3d_resolution, is_male, par_data, vio_file) File "classes.py", line 907, in impute_from_g3d_data self.con_lists[ref_name_tuple].impute_from_g3d_data(g3d_data, max_impute3_distance, max_impute3_ratio, min_impute3_separation, is_male, par_data, vio_file) File "classes.py", line 757, in impute_from_g3d_data con.impute_from_g3d_data(g3d_data, max_impute3_distance, max_impute3_ratio, min_impute3_separation, is_male, par_data, vio_file) File "classes.py", line 544, in impute_from_g3d_data impute3_ratio = impute3_distance / con_distance_tuples[1][1] TypeError: unsupported operand type(s) for /: 'NoneType' and 'NoneType'

Here are the head lines from two input files: head GM12878_cell1_dipc_phased.clean.impute.clean.3dg 1(mat) 1200000 7.95772097608 -12.0072914165 6.67592442321 1(mat) 1300000 8.89210987528 -11.4486456224 6.61131843187 1(mat) 1400000 8.8277193141 -10.3798272863 6.83290065793 1(mat) 1500000 8.10570766598 -9.67144265436 6.35097905003 1(mat) 1600000 7.99275487247 -8.53433974384 6.52683266786 1(mat) 1700000 6.70429668241 -8.61794012705 5.86833067325 1(mat) 1800000 5.62631622929 -8.49098630055 5.17833888961 1(mat) 1900000 4.8879961287 -8.44522282731 3.98121528589 1(mat) 2000000 3.80732676666 -7.76977419875 3.35459947567 1(mat) 2100000 3.06260319638 -8.19929825445 4.23444940641 zcat GM12878_cell1_dipc_phased.clean.con.gz | head 1,756415,. 1,1095231,. 1,757502,. 1,1218674,. 1,815689,. 1,1186165,. 1,818341,. 1,862101,. 1,830604,. 1,835996,. 1,839037,. 1,858631,. 1,848406,. 1,850417,. 1,858704,. 1,861316,. 1,861508,. 1,862932,. 1,918117,1 1,1231475,.

Here are the command I used to construct the input files: seqtk mergepe SRR7226683_1.fastq SRR7226683_2.fastq | lianti trim - | bwa mem -Cp bwa_index_rmchr/Homo_sapiens_assembly19.fasta - | samtools view -uS | sambamba sort -o GM12878_cell1_dipc_rmchr.bam /dev/stdin dip-c seg -v snps/NA12878.txt.gz GM12878_cell1_dipc_rmchr.bam | gzip -c > GM12878_cell1_dipc_phased.seg.gz dip-c con GM12878_cell1_dipc_phased.seg.gz | gzip -c > GM12878_cell1_dipc_phased.con.gz dip-c dedup GM12878_cell1_dipc_phased.con.gz | gzip -c > GM12878_cell1_dipc_phased.dedup.gz dip-c reg -p hf GM12878_cell1_dipc_phased.dedup.gz | gzip -c > GM12878_cell1_dipc_phased.reg.con.gz dip-c clean GM12878_cell1_dipc_phased.dedup.gz | gzip -c > GM12878_cell1_dipc_phased.clean.con.gz dip-c impute GM12878_cell1_dipc_phased.clean.con.gz | gzip -c > GM12878_cell1_dipc_phased.clean.impute.con.gz con_to_ncc.sh GM12878_cell1_dipc_phased.clean.impute.con.gz nuc_dynamics.sh GM12878_cell1_dipc_phased.clean.impute.ncc 0.1

Thanks a lot! Looking forward to your help!

Bo Zhang

tanlongzhi commented 5 years ago

Hi Bo,

You're right that the intermediate clean.3dg files were not used in any analysis. They just provided intermediate-resolution 3D structures you can look at while waiting for the final, high-resolution structures.

I'll look into the errors you got.

kainblue commented 5 years ago

Hi Longzhi,

Thank you really much for clarify the usage of clean.3dg data. For the error, I finally fix it by following carefully and exactly everything you describe in this repo. It seems to be caused by skipping the Lianti patch.
"Patching LIANTI For META read preprocessing, LIANTI needs a patch to replace the LIANTI adapters with the META ones:

Download the LIANTI source code. Replace LIANTI's trim.c with Dip-C's patch/trim.c. Compile LIANTI. " I went back and did this part. Now I can reproduce what you show in this repo.

Thank you really much for all the help!

Bo

tanlongzhi commented 5 years ago

Great!