mapleforest / HaploMerger2

40 stars 6 forks source link

HaploMerger2 Multi thread implementation #7

Open JFsanchezherrero opened 6 years ago

JFsanchezherrero commented 6 years ago

Dear @mapleforest, I am working on a genome assembly project which is a quite big genome (1.8 Gbp) and is highly partitioned maybe because of the heterozigosity level. There are nearly 3 millions contigs/scaffolds with an N50 around 800bp.

After reading your manuscript and the manual, I decided to use HaploMerger2 on the genome project I am working on. I knew the N50 was not suitable but I tried anyway.

I installed and run the test examples and I was delighted by the possible implications it might have in my data. I then followed the manual in order to set parameters, variables and threads for HaploMerger2 for my genome project and I sent the command using the run_all_bacth file.

I was astonished when I realised the amount of time It might take only for the initiation.pl step!! Due to the implementation you are using, that step is not multi-threading and so it would read and take one by one each sequence in the fasta file provided. After running for one day and a half it had only converted into nib files around 400k sequences, so I calculated it would take around 7-8 days just to finish the initiation step.

I realised that the initiation step is quite common through the HaploMerger2 workflow so I decided to further check the code and try implementing threads.

And so I did! I tested and debugged the results of a small test and I sent the process for my big genome project. The initiation step process that was about to take 7-8 days running was run in 9 hours using 60 CPUs.

It is still running, [hm.batchB2 right now], so I keep my fingers crossed for it to finishing successfully and give me some light into my genome assembly project!

You can find the information and details of the implementation in mi github profile: https://github.com/JFsanchezherrero/Haplomerger2_Multi-threads

I will give you further details if the code is working but also if it worked for my genome.

Regards,

Jose F.

mapleforest commented 6 years ago

Dear Jose,

HM2 will create files for each contig. When you have 3 million contigs, the Linux system takes forever to finish creating these files.

HM2 will never work on an assembly with an N50 of 800bp, and you have 3 million contigs/scaffolds because your assembly is too fragmented.

I am very sorry to say that HM2 is not suitable for this situation.

You may look for some way to increase the continuity of the raw assembly before trying HM2.

Best regards,

Shengfeng.

在 2017/9/8 16:39, Jose Francisco Sanchez-Herrero 写道:

Dear @mapleforest https://github.com/mapleforest, I am working on a genome assembly project which is a quite big genome (1.8 Gbp) and is highly partitioned maybe because of the heterozigosity level. There are nearly 3 millions contigs/scaffolds with an N50 around 800bp.

After reading your manuscript and the manual, I decided to use HaploMerger2 on the genome project I am working on. I knew the N50 was not suitable but I tried anyway.

I installed and run the test examples and I was delighted by the possible implications it might have in my data. I then followed the manual in order to set parameters, variables and threads for HaploMerger2 for my genome project and I sent the command using the run_all_bacth file.

I was astonished when I realised the amount of time It might take only for the initiation.pl step!! Due to the implementation you are using, that step is not multi-threading and so it would read and take one by one each sequence in the fasta file provided. After running for one day and a half it had only converted into nib files around 400k sequences, so I calculated it would take around 7-8 days just to finish the initiation step.

I realised that the initiation step is quite common through the HaploMerger2 workflow so I decided to further check the code and try implementing threads.

And so I did! I tested and debugged the results of a small test and I sent the process for my big genome project. The initiation step process that was about to take 7-8 days running was run in 9 hours using 60 CPUs.

It is still running, [hm.batchB2 right now], so I keep my fingers crossed for it to finishing successfully and give me some light into my genome assembly project!

You can find the information and details of the implementation in mi github profile: https://github.com/JFsanchezherrero/Haplomerger2_Multi-threads

I will give you further details if the code is working but also if it worked for my genome.

Regards,

Jose F.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mapleforest/HaploMerger2/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AOtnAIm4LMCAWIt1MlWaCss0ZhWpoY66ks5sgP1bgaJpZM4PQ3gu.

--

best regards,

黄盛丰 Shengfeng Huang 中山大学生命科学学院 School of life sciences, Sun Yat-sen university hshengf2@mail.sysu.edu.cn http://sklbc.sysu.edu.cn/Team/User/info.aspx?typeid=283&pid=46


本邮件及其附件含有发送给特定个人和用于特定目的的保密信息。如果您不是预期的收件人,请立即删除本邮件并通知发件人。严禁任何非预期的收件人使用、传播、分发或复制本邮件或其附件。 This email and its attachments may contain confidential information intended for a specific individual and purpose. If you are not the intended recipient, you should delete this email and notify the sender immediately. Any use, dissemination, distribution, or copying of this email or its attachments by persons other than the intended recipient(s), is strictly prohibited.