tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

Allele.ctg.table does not eliminate any read in prune step #138

Open edanchin opened 2 years ago

edanchin commented 2 years ago

Dear developers of AllHiC

Thanks a lot for having developed this software to solve the problem of scaffolding polyploid genomes resulting in false-positive contact information between closely related copies.

For the polyploid species I'm working with, there is no well assembled monoploid genome available for a related species. Therefore, I used McScanX on my annotated contigs to identify which contig is the copy of which other contig. I then used this information to generate an Allele.ctg.table with chromosome and position fields empty and replaced by NA NA as suggested in issue #9 . So at the end I have a tab-separated file like: NA NA contigx contigy contigz NA NA contigw contign etc. However, when I provide this file to AllHiC for the prune step, although the table seems to be read correctly according to the log.txt file, no read is eliminated and the file removedb_Allele.txt remains empty. The produced prunning.sam file still contains all the read pairs I want to be eliminated.

Any clue to solve this issue? Many thanks in advance

wangyibin commented 2 years ago

Hi, I have tested replacing the first two columns with NA, and it worked. Please provide some of your allele table.

edanchin commented 2 years ago

Hi and many thanks for your message,

I double checked my files and there was a problem in my Allele.ctg.table file which contained additional spaces instead of tabs.

I have removed the additional spaces and made sure everything was tab separated and now it works fine.

I am using AllHiC version 0.9.13 and the following command line:

ALLHiC_prune -i groups_100.txt -b sample.clean.bam -r Minc_v4_shac_genome.fasta

groups_100.txt is the tab-delimited file with first two columns being NA and the rest of the column indicating which contig is a copy of which other contig.

I checked the logs and removedb_Allele.txt and indeed it seems all the reads that corresponded to contacts between allelic contigs have been removed.

Now I'm running the next step (parition).

Just one additional question if I may.

Can I provide two restriction sites in the -e option ?

Indeed I use the Arima 2-enzymes kit which cuts both at GATC and GANTC.

Can use -e GATC,GANTC or only one enzyme is allowed?

Many thanks for your help

Etienne

--


                              Etienne G.J. Danchin
                             http://edanchin.org

Institut Sophia Agrobiotech INRAE - Univ. Côte d'Azur - CNRS

400 route des Chappes, BP 167 06903 Sophia-Antipolis Cedex France

http://www.paca.inra.fr/institut-sophia-agrobiotech Tel. +33 492 386 402 Fax. +33 492 386 587


De : Yibin Wang @.***> Envoyé : mardi 2 août 2022 03:24 À : tangerzhang/ALLHiC Cc : Etienne Danchin; Author Objet : Re: [tangerzhang/ALLHiC] Allele.ctg.table does not eliminate any read in prune step (Issue #138)

Hi, I have tested replacing the first two columns with NA, and it worked. Please provide some of your allele table.

- Reply to this email directly, view it on GitHubhttps://github.com/tangerzhang/ALLHiC/issues/138#issuecomment-1201909993, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2KDM67INUYGUYM7M6C5CB3VXB2EXANCNFSM55HGP6DA. You are receiving this because you authored the thread.Message ID: @.***>

wangyibin commented 2 years ago

Hi,

Can I provide two restriction sites in the -e option ?

You can use -e Arima in the latest version of ALLHiC_partition.

$ ALLHiC_partition
Usage: ALLHiC_partition -r draft.asm.fasta -e enzyme_sites -k Num of groups
      -h : help and usage.
      -b : prunned bam (optional, default prunning.bam)
      -r : draft.sam.fasta
      -e : enzyme_sites (HindIII: AAGCTT; MboI: GATC, Arima)
      -k : number of groups (user defined K value)
      -m : minimum number of restriction sites (default, 25)
edanchin commented 2 years ago

Many thanks,

I will re-run everything with the last version !

One last question if I may: the species I am studying is triploid with 3n= 45 -47 chromosomes. The closest non-polyploid genome has n=16 chromosomes. Hence, I am wondering how many groups I should select for the partition step... 47, 16, 3 ?

Any advice about this parameter?

Thanks again for your help

Etienne

--


                              Etienne G.J. Danchin
                             http://edanchin.org

Institut Sophia Agrobiotech INRAE - Univ. Côte d'Azur - CNRS

400 route des Chappes, BP 167 06903 Sophia-Antipolis Cedex France

http://www.paca.inra.fr/institut-sophia-agrobiotech Tel. +33 492 386 402 Fax. +33 492 386 587


De : Yibin Wang @.***> Envoyé : mardi 2 août 2022 11:10 À : tangerzhang/ALLHiC Cc : Etienne Danchin; Author Objet : Re: [tangerzhang/ALLHiC] Allele.ctg.table does not eliminate any read in prune step (Issue #138)

Hi,

Can I provide two restriction sites in the -e option ?

You can use -e Arima in the latest version of ALLHiC_partition.

$ ALLHiC_partition Usage: ALLHiC_partition -r draft.asm.fasta -e enzyme_sites -k Num of groups -h : help and usage. -b : prunned bam (optional, default prunning.bam) -r : draft.sam.fasta -e : enzyme_sites (HindIII: AAGCTT; MboI: GATC, Arima) -k : number of groups (user defined K value) -m : minimum number of restriction sites (default, 25)

- Reply to this email directly, view it on GitHubhttps://github.com/tangerzhang/ALLHiC/issues/138#issuecomment-1202224736, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2KDM65EIILS7CGFGAATS2LVXDQZXANCNFSM55HGP6DA. You are receiving this because you authored the thread.Message ID: @.***>

wangyibin commented 2 years ago

You can try setting 47 groups or more. From our experience, the final grouping of chromosomes can be determined using Hi-C heatmaps. Too small groups can be discarded manually. Or, split chromosomes also can be merged manually.

edanchin commented 2 years ago

Many thanks, I will try with 50 groups and then manually edit the assembly using the contact map if I notice some problems in the scaffolding.

All the best

Etienne

--


                              Etienne G.J. Danchin
                             http://edanchin.org

Institut Sophia Agrobiotech INRAE - Univ. Côte d'Azur - CNRS

400 route des Chappes, BP 167 06903 Sophia-Antipolis Cedex France

http://www.paca.inra.fr/institut-sophia-agrobiotech Tel. +33 492 386 402 Fax. +33 492 386 587


De : Yibin Wang @.***> Envoyé : mercredi 3 août 2022 04:40 À : tangerzhang/ALLHiC Cc : Etienne Danchin; Author Objet : Re: [tangerzhang/ALLHiC] Allele.ctg.table does not eliminate any read in prune step (Issue #138)

You can try setting 47 groups or more. From our experience, the final grouping of chromosomes can be determined using Hi-C heatmaps. Too small groups can be discarded manually. Or, split chromosomes also can be merged manually.

- Reply to this email directly, view it on GitHubhttps://github.com/tangerzhang/ALLHiC/issues/138#issuecomment-1203418339, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2KDM62EIIHNWBN4Z76FA4LVXHL25ANCNFSM55HGP6DA. You are receiving this because you authored the thread.Message ID: @.***>