tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

Problem about release3DDNA.pl #128

Closed kedduck closed 2 years ago

kedduck commented 2 years ago

Dear tangerzhang

Thanks for your tools about assembly of polyploid.

Recently I am doing assembly of a autotetraploid genome, and I follow the workflow #15. After running 3D-DNA, I run the script release3DDNA.pl , it reports "Substitution loop at release3DDNA.pl line 12, chunk 2."

Does it due to the long sequence?

I read the previous issue and find the function about this script #8. Could I directly use the seq.FINAL.fasta as the input of ALL-HiC?

Expect for your reply

Thanks!

wangyibin commented 2 years ago

Hi, This seems to be the Perl issue for regular expressions on very long strings (https://www.nntp.perl.org/group/perl.perl5.porters/2014/10/msg221751.html). And, you couldn't directly use the seq.FINAL.fasta as input for ALLHiC, because it is a chromosome-scale assembly. This command can directly output the 3DDNA corrected assembly:

seqtk cutN -n 100 seq.FINAL.fasta | seqkit replace -p "^(\\S+)\\s?" -r 'tig{nr}' --nr-width 7 > tig.HiCcorrected.fasta
kedduck commented 2 years ago

Thanks for your kindly reply!

kedduck commented 2 years ago

Dear Developer:

Thanks for your answer. When I continue using ALL-HiC, in the partation part that most contig is in group 1. I think it may be slight different between release3DDNA.pl and your command.

The perl script split the seq in every "N" and the command split the seq in "N" > 100.

Am I right? Could you please provide some advise or a new script(although it may be unreasonable)?

Yujiaxin419 commented 2 years ago

Hi,

By default, 3DDNA adds 100Ns between contigs as gap. Therefore I think split the seq in every "N" or split the seq in "N" >100, probably won't make a great difference to the result.

This problem is common in assembly a complex plant polyploidy genome. There are many possible reasons, such as too short contig N50, chimeric contigs and so on.

There is the pipeline I personally use to assemble complex genome:

  1. CANU contigs assembly. (see https://github.com/tangerzhang/ALLHiC/issues/15)
  2. Pacbio CLR assembly need polish, I usually use nextpolish with default setting.
  3. Contigs correction, I usually use 3d-dna or ALLHiC-corrector.
  4. Mapping Hi-C reads using HiC-Pro
  5. ALLHiC pipeline, for ALLHiC_partition you can adjust the parameter "--minREs", "--maxLinkDensity", "--nonInformativeRatio" in allhic partition (notice: no ALLHiC_partition), until you obtain groups with reasonable length.
  6. In addition, for some groups with abnormal length, you can use redundans to remove redundant contig, and use ALLHiC_rescue to regroup.
kedduck commented 2 years ago

Thanks for your great answer! It’s very helpful for me!