mahulchak / quickmerge

A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.
GNU General Public License v3.0
198 stars 31 forks source link

Confusion about argument ```-l``` and ```-ml``` #62

Open neptuneyt opened 3 years ago

neptuneyt commented 3 years ago

Dear quickmerge teams, I have installed the latest quickmerge which could support mumer 4,but I was confused by the argument -l and -ml, according the manual,

-l LENGTH_CUTOFF, --length_cutoff LENGTH_CUTOFF,which means minimum seed contig length to be merged (default=0) -ml MERGING_LENGTH_CUTOFF, --merging_length_cutoff MERGING_LENGTH_CUTOFF,which means setting the merging length cutoff necessary for use in quickmerge (default 5000) Does it means the same as described in the picture below? image Thanks a lot!

mahulchak commented 3 years ago

Hi,

-l represents the minimum length of the seed contig. In the figure, the large blue circle is the seed contig and -l would determine its minimum length. In your highlighted example, the orange one is the seed contig and -l would determine its minimum length.

The description of -ml is a little off. -ml determines the minimum alignment length that will be included in the merging process. Any alignment lesser than -ml will not be merged.

I hope this helps. Let me know if you have any other questions.

On Sun, Jan 24, 2021 at 12:12 AM neptuneyt notifications@github.com wrote:

Dear quickmerge teams, I have installed the latest quickmerge which could support mumer 4,but I was confused by the argument -l and -ml, according the manual,

-l LENGTH_CUTOFF, --length_cutoff LENGTH_CUTOFF,which means minimum seed contig length to be merged (default=0) -ml MERGING_LENGTH_CUTOFF, --merging_length_cutoff MERGING_LENGTH_CUTOFF,which means setting the merging length cutoff necessary for use in quickmerge (default 5000) Does it means the same as described in the picture below? [image: image] https://user-images.githubusercontent.com/39893798/105624681-a00bd980-5e5e-11eb-951b-cf8ba71b3926.png Thanks a lot!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/62, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZQH2B6S2AORBSD27SQDODS3PI4TANCNFSM4WQK7MVQ .

-- Mahul Chakraborty Department of Ecology and Evolutionary Biology University of California-Irvine Phone: 949 824 9559 Fax: 949 824 9559 Website: https://mahulchakraborty.wordpress.com/ Github: https://github.com/mahulchak

neptuneyt commented 3 years ago

Thanks your kindly reply in time, but I still failed to understand the -ml,I have test the -l and -ml , results as below: image from the first table, I test -l from 10-5000,but the merged sequence could not improve compare to raw two contig sets; the second table, I test -ml from 1-5000, the merged sequence quality were affected by its' length. so how can I understand such result? Looking forward your reply, thanks a lot.

neptuneyt commented 3 years ago

Sorry for disturb you, I have done another pure test: I extracted 10k contigs from two assembly, respectively.Then command:

nohup merge_wrapper.py    -l 10 -ml 10 -t 50 -v C1_10k.fa C2_10k.fa  &>log&

from param_summary_out.txt,I could count 205 pair of overlaped contig,so it was just account 0.0205 (205/10000) overlaped rate.

REF QUERY REF_START REF_END Q_START Q_END ORIENTATIONINNIE(1/0) OVERLAP_LEN OVERLAP_PROP NO_OVERLAP_AT_ENDS OVERHANG 1 Cluster2_k141_213238 Cluster1_k141_1058634 2107 2631 53324809 R 0 523 0.263343 1986 3787 ... Cluster2_k141_15754305 Cluster1_k141_1064426 57 3426 4010642 L 0 3368 4.82521 698 154 205 Cluster2_k141_11926444 Cluster1_k141_109005 2257 3942 21993893 L 1 1694 0.6776 2500 0

my raw two 10k contigs total size was 113M(113248645 bp), but the merged_out.fasta total size was 59M(59258534 bp) , it does not make sense given the low overlaped rate(2%). so I checked one of overlaped pairs, the overlap relationship as below:

REF QUERY REF_START REF_END Q_START Q_END ORIENTATIONINNIE(1/0) OVERLAP_LEN OVERLAP_PROP NO_OVERLAP_AT_ENDS OVERHANG Cluster2_k141_6817205 Cluster1_k141_1166759 1020 4705 1 3684 R 03683 3683 1 257

image

And I found a sequence named Cluster2_k141_6817205 in the merged_out.fasta,it seems the merged sequence names the largest one of two overlaped contigs, and it was correctly! So strangely!

And then, I check the merged_out.fasta ID, Source Numbers
from Cluster1 9941
from Cluster2 58

merged_out.fasta 9941 Cluster1 source sequence, it seems all merged contig length are same as raw length

Source contig_length Raw.tsv:Cluster1_k141_1025 3698 Merge.tsv:Cluster1_k141_1025 3698 Raw.tsv:Cluster1_k141_1026 3852 Merge.tsv:Cluster1_k141_1026 3852 Raw.tsv:Cluster1_k141_1040 8359 Merge.tsv:Cluster1_k141_1040 8359 Raw.tsv:Cluster1_k141_1057577 8707 Merge.tsv:Cluster1_k141_1057577 8707 Raw.tsv:Cluster1_k141_1057886 3968 Merge.tsv:Cluster1_k141_1057886 3968 Raw.tsv:Cluster1_k141_1057955 3078 Merge.tsv:Cluster1_k141_1057955 3078 Raw.tsv:Cluster1_k141_1058039 3038 Merge.tsv:Cluster1_k141_1058039 3038 Raw.tsv:Cluster1_k141_1058096 4079 Merge.tsv:Cluster1_k141_1058096 4079 Raw.tsv:Cluster1_k141_1058151 3719 Merge.tsv:Cluster1_k141_1058151 3719 Raw.tsv:Cluster1_k141_1058269 3248 Merge.tsv:Cluster1_k141_1058269 3248 Raw.tsv:Cluster1_k141_1058399 7611 Merge.tsv:Cluster1_k141_1058399 7611

merged_out.fasta 58 Cluster2 source sequence, merged contig length are large than raw length

Raw.tsv:Cluster2_k141_10429771 4993 Merge.tsv:Cluster2_k141_10429771 5069 Raw.tsv:Cluster2_k141_10436849 10727 Merge.tsv:Cluster2_k141_10436849 12696 Raw.tsv:Cluster2_k141_10643446 5615 Merge.tsv:Cluster2_k141_10643446 7713 Raw.tsv:Cluster2_k141_1067037 6430 Merge.tsv:Cluster2_k141_1067037 6430 Raw.tsv:Cluster2_k141_1067215 11431 Merge.tsv:Cluster2_k141_1067215 20071 Raw.tsv:Cluster2_k141_1067595 11140 Merge.tsv:Cluster2_k141_1067595 11140 Raw.tsv:Cluster2_k141_10859382 4492 Merge.tsv:Cluster2_k141_10859382 4492 Raw.tsv:Cluster2_k141_11711522 6219 Merge.tsv:Cluster2_k141_11711522 7268 Raw.tsv:Cluster2_k141_11713665 3653 Merge.tsv:Cluster2_k141_11713665 5739 Raw.tsv:Cluster2_k141_12137628 6638 Merge.tsv:Cluster2_k141_12137628 7152 Raw.tsv:Cluster2_k141_1279455 28667 Merge.tsv:Cluster2_k141_1279455 29290

So, how can I explain above result, In my opinion, does quickmerge's final merged genome are output the extend two overlapped contigs pair and plus the non-overlapped contigs in each sets?
Looking forward your reply, Thanks a lot!