How to identify merged contigs in output

adadiehl commented 4 months ago

I am attempting to use quickmerge in an attempt to merge contigs from a long-read assembly into chromosome-length scaffolds using the hg38 reference genome. Please see the example command below:

merge_wrapper.py GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.fa -hco 5 -c 1.5 -l 500000 -v -t 24

The long-read assembly is from (here)[https://s3.amazonaws.com/1000g-ont/ALIGNMENT_AND_ASSEMBLY_DATA/FIRST_100/NAPU_PIPELINE/HG38/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta] and hg38.fa is straight from UCSC.

Looking at the results, I am unable to tell which contigs in the output result from merging operations, so cannot compare to input contigs/chromosomes to ensure the output is correct. How do I identify merged contigs in the output fasta, and how do I tell what editing operations were done to generate the merged contigs? My goal here is to end up with the same number of contigs as hg38, excepting unmapped contigs, which appear to be written to the output unchanged. (Can you verify this is the case?)

Thank you in advance for your help!

mahulchak commented 4 months ago

The merged contigs will have names from the 'reference' assembly. Unchanged ones will have names of the 'query' assembly contigs.

On Thu, Jul 18, 2024, 2:08 PM Adam Diehl @.***> wrote:

I am attempting to use quickmerge in an attempt to merge contigs from a long-read assembly into chromosome-length scaffolds using the hg38 reference genome. Please see the example command below:

merge_wrapper.py GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.fa -hco 5 -c 1.5 -l 500000 -v -t 24

The long-read assembly is from (here)[ https://s3.amazonaws.com/1000g-ont/ALIGNMENT_AND_ASSEMBLY_DATA/FIRST_100/NAPU_PIPELINE/HG38/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta https://urldefense.com/v3/__https://s3.amazonaws.com/1000g-ont/ALIGNMENT_AND_ASSEMBLY_DATA/FIRST_100/NAPU_PIPELINE/HG38/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC/GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta__;!!CzAuKJ42GuquVTTmVmPViYEvSg!OulPU3IsW7KYoQ9kJ-hHD1xwvGCXqkrb1HH4SFWUps_4VxtxZY__3THKzB6R02T2zQqhO5FokVw9G_hf5WfOHsf5$] and hg38.fa is straight from UCSC.

Looking at the results, I am unable to tell which contigs in the output result from merging operations, so cannot compare to input contigs/chromosomes to ensure the output is correct. How do I identify merged contigs in the output fasta, and how do I tell what editing operations were done to generate the merged contigs? My goal here is to end up with the same number of contigs as hg38, excepting unmapped contigs, which appear to be written to the output unchanged. (Can you verify this is the case?)

Thank you in advance for your help!

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/mahulchak/quickmerge/issues/74__;!!CzAuKJ42GuquVTTmVmPViYEvSg!OulPU3IsW7KYoQ9kJ-hHD1xwvGCXqkrb1HH4SFWUps_4VxtxZY__3THKzB6R02T2zQqhO5FokVw9G_hf5U0oGwbS$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABZQH2CTYQJU4QQCUGTCFCDZNAHCXAVCNFSM6AAAAABLDIOUISVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTOMJYGQ4TAMY__;!!CzAuKJ42GuquVTTmVmPViYEvSg!OulPU3IsW7KYoQ9kJ-hHD1xwvGCXqkrb1HH4SFWUps_4VxtxZY__3THKzB6R02T2zQqhO5FokVw9G_hf5YX2N8ag$ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

adadiehl commented 4 months ago

Thank you for the fast response!

I took another look at the output and am finding no evidence of merging. See the following experiment:

merge_wrapper.py ../../GM18519-ONT-hg38-R9-LSK110-guppy-sup-5mC.hapdup_dual_1.fasta hg38.chr1.fa -hco 1 -c 1 -l 0 -v -t 24
awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' <(echo "chr1") merged_out.fasta > tmp.txt
md5sum tmp.txt
c7f5ee27a469297fa051ecb355aa2d40  tmp.txt
md5sum hg38.chr1.fa
c7f5ee27a469297fa051ecb355aa2d40  hg38.chr1.fa

(where hg38.chr1.fa is simply the chr1 record from the full hg38 assembly, extracted into its own file)

So it seems chromosome 1 (plus all the contigs from the query assembly) are present in the output, but there is no evidence that chr1 was used to merge anything. It seems like quickmerge is simply passing all the contigs from both files through unchanged. I have tried setting the -hco, -c, and -l params unrealistically low to see if this affects merging but there was no change to the results. Why would this happen?

To make sure I understand correctly, the query assembly is the first argument to merge_wrapper.py and the second argument is the reference. Is that right?

mahulchak / quickmerge

How to identify merged contigs in output #74