tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

Multiple restriction enzyme #31

Closed wyim-pgl closed 4 years ago

wyim-pgl commented 4 years ago

Dear Xingtan, Do you have any plan to update multiple restriction enzyme functions? I love you. Won

tangerzhang commented 4 years ago

OK, OK. We will need Haibao to revise his 'allhic' program with GO language and I will revise my PERL script accordingly. Hi @tanghaibao , although Won is likely the only one who are using multiple restriction enzymes in Hi-C scaffolding, we have no reason to deny the request from our beloved friend. Do you have time to update allhic?

wyim-pgl commented 4 years ago

@tangerzhang Xingtan allhic extract might need an update to do this. Arima uses GATCGATC, GANTGATC, GANTANTC, GATCANTC, which will have the combination below. @tanghaibao Arima Hi-C is ready and I will use this for B. carinata genome assembly. Best, Won

GATCGATC,GAATGATC,GATTGATC,GACTGATC,GAGTGATC,GAATAATC,GAATATTC,GAATACTC,GAATAGTC,GTATAATC,GTATATTC,GTATACTC,GTATAGTC,GCATAATC,GCATATTC,GCATACTC,GCATAGTC,GGATAATC,GGATATTC,GGATACTC,GGATAGTC,GATCAATC,GATCATTC,GATCACTC,GATCAGTC
tanghaibao commented 4 years ago

@tangerzhang @wyim-pgl

I made the changes here and made new releases. You can now run:

allhic extract --RE="GATCGATC,GANTGATC,GANTANTC,GATCANTC" test.bam test.fasta
wyim-pgl commented 4 years ago

Thank you, dear!!. I will run it right now. Won

sjannielefevre commented 4 years ago

Hi

I have data generated using the Arima HiC+ kit, which has two restriction enzymes: Mbo I (^GATC) and Hinf I (G^ANTC). The ligation motifs are thus: GATCGATC, GANTGATC, GANTANTC, GATCANTC

I have read this thread, where it is described how I specify multiple enzymes to 'allhic extract' (i.e. --RE="GATCGATC, GANTGATC, GANTANTC, GATCANTC"). But it is not described in the manual how it is specified for the first step in the pipeline (diploid genome), 'ALLHiC_partition'. Here the argument is described as follows:

-e: enzyme_sites (HindIII: AAGCTT; MboI: GATC)

Can someone clarify this?

Best, Sjannie

tangerzhang commented 4 years ago

Hi

I have data generated using the Arima HiC+ kit, which has two restriction enzymes: Mbo I (^GATC) and Hinf I (G^ANTC). The ligation motifs are thus: GATCGATC, GANTGATC, GANTANTC, GATCANTC

I have read this thread, where it is described how I specify multiple enzymes to 'allhic extract' (i.e. --RE="GATCGATC, GANTGATC, GANTANTC, GATCANTC"). But it is not described in the manual how it is specified for the first step in the pipeline (diploid genome), 'ALLHiC_partition'. Here the argument is described as follows:

-e: enzyme_sites (HindIII: AAGCTT; MboI: GATC)

Can someone clarify this?

Best, Sjannie

Hi Siannie, We do not have Arima Hi-C data to test it, but theoretically the original command should be OK, e.g.

ALLHiC_partition -r draft.asm.fasta -b sample.clean.bam -k 4 -e GATCGATC,GANTGATC,GANTANTC,GATCANTC

Please note that there should be no space between enzyme sites.

sjannielefevre commented 4 years ago

Thank you very much for clarifying!

I noticed that when running the wrapper, ALLHiC_partition, the output from allhic extract becomes GATC_GANTC.txt, while the next command in the pipeline, allhic partition, expects a file named GATC,GANTC.txt (but I guess commas do not work well in file names and that is why it is changed). So I ran the commands manually (which is fine).

I am now rerunning as per your suggestion (except not running the wrapper, but the extract and partition separately), as I can see I did not specify the sites correctly (I just used GATC,GANTC), and I can see that this obviously caused an inflated number of RE sites to be found, around 1 per 187 bp vs. 1 per 3800bp.

UPDATE: Worked well, got scaffolds of expected chromosome sizes.

wyim-pgl commented 4 years ago

I ran with Amira-kit and it worked. I think I merged them all together (merged restriction location bed).