mapleforest / HaploMerger2

40 stars 6 forks source link

Haplomerger2 after FALCON run #6

Open a-velt opened 7 years ago

a-velt commented 7 years ago

Dear Haplomerger2 developer,

First of all, thank you for this nice tool, which give me very good results !

But I have one question about on how it works. I have assembled my genome (Grape genome, 500Mb, highly heterozygous) from pacbio reads (120X of depth) with the FALCON tool. So I have two file at the end, one file containing the primary contigs (which can be represented by one allele version) and one file containing the alternative contigs (called haplotigs). Thanks to the length of pacbio reads, my assembly is (normally) phased.

But, after analyzing my assembly, I realized that some "haplotigs" were still present in my primary contigs, due to excessive heterozygosity in this "region". That's why I decided to use haplomerger2 to fix this.

I put my primary contigs and my haplotigs in the same file and I launched haplomerger2, after creating my own score matrix.

I have very good results, but I have a question. Is my assembly always phased? In haplomerger2's way of working, is it possible that he could have changed something in my variations? More clearly, did he make any changes in my sequences or did he "just" separate my two alleles.

I thank you again for this super tool which allowed to clean very effectively my primary contigs!

Best, Amandine

mapleforest commented 7 years ago

Dear Amandine,

Only the "gap filling" stage will add new nucleatides to the assembly.

And the "tandem removal" stage may remove some sequences from the assembly.

And the "HaploMerger" stage may connect two super contigs together if they have suffciently long overlap.

In your case, since you have a very good diploid and well-phased diploid assemby,

if you do not want HM2 to connect any supper contigs (to avoid HM2-induced phase switches),

you may set minOverlap to a very large value, such as 99,999,999 (see details in page 15 of the manual),

in order to forbid contig joining.

Besides, you may try to understand the output file hm.new_scaffolds, which contains the

layout used to create the reference haploid assembly from the original diploid assembly.

You can actually create you own choice of haploid assembly based on the data of this file.

Hope these will help you.

Best regards,

在 2017/8/7 17:08, Amandine Velt 写道:

Dear Haplomerger2 developer,

First of all, thank you for this nice tool, which give me very good results !

But I have one question about on how it works. I have assembled my genome from pacbio reads (120X of depth) with the FALCON tool. So I have two file at the end, one file containing the primary contigs (which can be represented by one allele version) and one file containing the alternative contig (called haplotig). Thanks to the length of pacbio reads, my assembly is phased.

But, after analyzing my assembly, I realized that some "haplotigs" were still present in my primary contigs. That's why I decided to use haplomerger2 to fix this.

I put my primary contigs and my haplotigs in the same file and I launched haplomerger2, after creating my own score matrix.

I have very good results, but I have a question. Is my assembly always phased? In haplomerger2's way of working, is it possible that he could have changed something in my SVs? More clearly, did he make any changes in my sequences or did he "just" separate my two alleles.

I thank you again for this super tool which allowed to clean very effectively my primary contigs!

Best, Amandine

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mapleforest/HaploMerger2/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AOtnAB9iY0_UmY4lElbyywfu3Y3jvWxTks5sVtQEgaJpZM4OvLUc.

--

best regards,

黄盛丰 Shengfeng Huang 中山大学生命科学学院 School of life sciences, Sun Yat-sen university hshengf2@mail.sysu.edu.cn http://sklbc.sysu.edu.cn/Team/User/info.aspx?typeid=283&pid=46


本邮件及其附件含有发送给特定个人和用于特定目的的保密信息。如果您不是预期的收件人,请立即删除本邮件并通知发件人。严禁任何非预期的收件人使用、传播、分发或复制本邮件或其附件。 This email and its attachments may contain confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this email and notify the sender immediately. Any use, dissemination, distribution, or copying of this email or its attachments by persons other than the intended recipient(s), is strictly prohibited.

a-velt commented 7 years ago

Thank you for your advices !

In addition to creating my own score matrix, I also set the maskFilter parameter to 85 and the redundantFilter parameter to 85 as I was losing too many sequences during the first HM2 run.

Now, I remove the step 4 (remove tandem errors from haploid assemblies) and I set the minOverlap parameter to 99999999.

During the _A3.pathFinder_preparation step, I got the following error : 'Species included: corrected_assembly_windowmasker corrected_assembly_windowmaskerx Set the scoring scheme to score, and set the filter score/ali_len to 100000 . Set to OVER-WRITING mode! Set to DELETING mode! Produce tsc/qsc_ids, total id are 4733 ! Read the zeroMinSpace.rbest.net.gz file and do some basic node filtering ... Modification of non-creatable array value attempted, subscript -1 at HaploMerger2_20161205/bin/HM_pathFinder_preparation.pl line 252.'

With the command : "HaploMerger2_20161205/bin/HM_pathFinder_preparation.pl --Species corrected_assembly_windowmasker corrected_assembly_windowmaskerx --scoreScheme=score --filter=100000 --Force --Delete"

And in A3.pathFinder step I got "Can not open corrected_assembly_windowmasker.corrected_assembly_windowmaskerx.result/hm.scaffolds!".

Have you an idea about it ?

Best, Amandine

EDIT : I forgot to copy the ".ctl" files to my working folder, everything works fine. Thanks again.

mapleforest commented 7 years ago

Dear Amandine,

  1. I think there are fewer tandem errors in a PacBio-based assembly.

  2. If many sequences are sent to the unpaired sequence file, it means the polymorphism rate is low and many contigs do not have a corresponding alleles in the raw assembly. What you need to do is to merge the unpaired sequences and the main assembly file by using hm.batchB4-5. So what is the size and the polymorphism rate in the diploid assembly?

  3. hm.batchA does not remove any sequences, it just break up the potential misjoin, you do not need to set the minOverlap in this stage.

  4. About the error it appears the chainNet result is not complete. Is it possible the open file handles are not sufficient? How many contigs do you have? (If the raw diploid assembly has >2000 scaffold sequences,bmake sure to lift your system’s openable file handle limit by invocating “ulimit –n 655350” before running HM2).

  5. A very relevant point I forgot to mentioned: The algorithms of de novo assemblers (and Haplomerger, too) never gurantee phased assemblies. So I recommend first to create a good reference haploid assembly which is focusing on correctness, continuity and representativity; then to use this haploid assembly and a professional phasing tool to create good phased assemblies for this sequenced individuals or other resequencing indidivuals. In this case you do not need to re-set minOvelap, after all HM2 many connect different contigs,but will not tamper the phased sequences.

Best regards,

在 2017/8/7 21:46, Amandine Velt 写道:

Thank you for your advices !

In addition to creating my own score matrix, I also set the maskFilter parameter to 85 and the redundantFilter parameter to 85 as I was losing too many sequences during the first HM2 run.

Now, I remove the step 4 (remove tandem errors from haploid assemblies) and I set the minOverlap parameter to 99999999.

During the _A3.pathFinder_preparation step, I got the following error : 'Species included: corrected_assembly_windowmasker corrected_assembly_windowmaskerx Set the scoring scheme to score, and set the filter score/ali_len to 100000 . Set to OVER-WRITING mode! Set to DELETING mode! Produce tsc/qsc_ids, total id are 4733 ! Read the zeroMinSpace.rbest.net.gz file and do some basic node filtering ... Modification of non-creatable array value attempted, subscript -1 at HaploMerger2_20161205/bin/HM_pathFinder_preparation.pl line 252.'

With the command : "HaploMerger2_20161205/bin/HM_pathFinder_preparation.pl --Species corrected_assembly_windowmasker corrected_assembly_windowmaskerx --scoreScheme=score --filter=100000 --Force --Delete"

Have you an idea about it ?

Best, Amandine

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mapleforest/HaploMerger2/issues/6#issuecomment-320667914, or mute the thread https://github.com/notifications/unsubscribe-auth/AOtnANBDzfEnCWnoIh6AdvB2xehqCEnRks5sVxU7gaJpZM4OvLUc.

--

best regards,

黄盛丰 Shengfeng Huang 中山大学生命科学学院 School of life sciences, Sun Yat-sen university hshengf2@mail.sysu.edu.cn http://sklbc.sysu.edu.cn/Team/User/info.aspx?typeid=283&pid=46


本邮件及其附件含有发送给特定个人和用于特定目的的保密信息。如果您不是预期的收件人,请立即删除本邮件并通知发件人。严禁任何非预期的收件人使用、传播、分发或复制本邮件或其附件。 This email and its attachments may contain confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this email and notify the sender immediately. Any use, dissemination, distribution, or copying of this email or its attachments by persons other than the intended recipient(s), is strictly prohibited.

a-velt commented 7 years ago
  1. Yes, I seen few differences between my _A_ref.fa and my ref_D.fa files, so few tandem errors.

  2. After HM2 run, with minOverlap=99999999 or the default value, I have a size genome (from ref.fa) of 515 Mb. How can I determine the polymorphism rate ?

  3. Ok, I will re-run HM2 without changing minOverlap parameter for hm.batchA.

  4. The error was because I forgot to copy the ".ctl" files to my working folder. Everything is working fine now.

  5. In the manual of Falcon assembly tool, it is said that Falcon phase heterozygous SNPs, so I want to keep this phase after HM2 run. I have primary contigs and alternative contigs after Falcon, and I want to keep the same sequence. With HM2, I just want to put the remaining alternative contigs in the primary contigs file to the alternative contigs file. And I don't know if HM2 change something in my input sequences. Before launching HM2, I put my primary contigs and alternative contigs in one file.

Before changing the minOverlap option, I had good statistics on my reference assembly after HM2 : Number of contigs : 1537 (before HM2 I had 1825 contigs) Genome size : 515 Mb (before HM2 I had 588 Mb) L50 : 85 contigs (before HM2 I had 110 contigs)

I launched BUSCO on the assembly after HM2 and I found more complete genes than before HM2, so I'm very happy of these results, but I don't know if my assembly has kept the same phase.

After changing the minOverlap option, I have worse statistics on my reference assembly : Number of contigs : 2335 (more than before HM2 because I added the haplotigs to my primary contigs, this did not cause a problem before changing the minOverlap option) Genome size : 515 Mb (before HM2 I had 588 Mb) L50 : 154 contigs (before HM2 I had 110 contigs)

So I think the results are better by not changing the minOverlap option, but I don't know how to check if the phase has not changed compared to the assembly of Falcon.

Sorry for all these questions and the time it takes you.

Best, Amandine

mapleforest commented 6 years ago

Dear Amandine, If you have two contigs of different phases, then HaploMerger might join these two contigs together.

I still recommend to compute a better phased genome assembly based on the initial reference genome coming off from the falcon-HaploMerger2 pipeline.

Anyway, if you want to know in the reference assembly, which scaffolds from which contigs, you can refer to the output file named "hm.new_scaffolds" from the batchB*. This file shows how HM2 chooses the raw contigs to form the final scaffolds.

Best regards, Shengfeng.