Mapping Paired-End Reads with PHG Takes Too Long

sidevshiy commented 1 month ago

Description

I have extracted the hVCF files and built the k-mer index. Now, I want to map paired-end reads. When I run the following command:

export JAVA_OPTS="-Xmx800g" phg map-kmers \ --hvcf-dir /fastscratch/liakh/phg2/data/exported_hvcf \ --kmer-index /fastscratch/liakh/phg2/output/vcf_files/kmerIndex.txt \ --key-file /fastscratch/liakh/phg2/data/key_files/read_mapping_data.txt \ --output-dir /fastscratch/liakh/phg2/output/read_mappings \ --threads 30

the log shows that everything is okay and it seems to build the graph, but it has been running for almost 20 hours and is still not done. I assume that there might be a problem somewhere or is it normal?

Jul 23, 2024 10:03:35 AM net.maizegenetics.phgv2.pathing.MapKmers run INFO: The --min-proportion-same-reference-range option is not used unless --limit-single-ref-range is used. Ignoring --min-proportion-same-reference-range. Jul 23, 2024 10:03:35 AM net.maizegenetics.phgv2.pathing.MapKmers run INFO: Begin mapping reads to the pangenome kmer index. [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:35,219: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample1.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:37,008: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample2.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:38,713: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample3.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:40,504: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample4.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:42,011: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample5.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:43,916: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample6.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:45,761: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample7.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:47,713: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample8.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:49,433: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample9.h.vcf [DefaultDispatcher-worker-1] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:51,242: processFiles: /fastscratch/liakh/phg2/data/exported_hvcf/sample10.h.vcf [main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:58,314: rangeToSampleToChecksum: 173177 x 10 [main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:58,468: numOfSamples: 10 [main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 10:03:58,581: numOfRanges: 173177

Expected behavior

No response

PHG version

phg version 2.3.7.144

sidevshiy commented 1 month ago

Just wanted to mention that HaplotypeGraph built during kmer index command much faster. I just not sure if it is right behavior with map-kmers command

[main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 09:46:38,975: rangeToSampleToChecksum: 173177 x 10 [main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 09:46:38,975: numOfSamples: 10 [main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 09:46:38,976: numOfRanges: 173177 [main] INFO net.maizegenetics.phgv2.pathing.BuildKmerIndex 2024-07-23 09:46:39,015: HaplotypeGraph built in 23123.255454 ms

zrm22 commented 1 month ago

Hello,

It looks like map-kmers step is taking a long time for either loading the Graph as you mention or loading in the Kmer index. Another user has reported this and I am working over the next few days to track down why its so slow. How big is the current Kmer index file you are loading in?

I am going to try to reproduce this on our end and see where the time sink is.

sidevshiy commented 1 month ago

Thank you for the quick reply! My kmerIndex is 4.2GB.

zrm22 commented 1 month ago

Ok so not too bad size wise. This should not take more than a few minutes to load. Our other report of this issue does mention that after its loaded in the actual mapping takes a few minutes per sample, but we definitely have a bug/inefficiency somewhere. I will keep you in the loop as we figure this out. Thanks!

sidevshiy commented 1 month ago

Thank you for your quick response! I think the issue might be that I built the k-mer index using AnchorWave output MAF files, which I reformatted to sample1.h.vcf.gz and sample1.g.vcf.gz as mentioned in docs. Your documentation mentions that I should load h.vcf.gz to tileDB and then for mapping I have to export these files, but I skipped the exporting step since I thought that my h.vcf already generated.

I am currently rebuilding the k-mer index using the exported files. I'll let you know once the k-mer index is completed, and I have tried running the map command again.

zrm22 commented 1 month ago

I think I have an idea of what might be causing this and if correct it is a very recent bug that I added in. I am in the process of verifying but need an up to date index file which I am in the process of building so I can get a baseline and pinpoint the exact cause.

That being said can you try running your files through a release from version 2.2? This should still work with your files but does not have the code which I think is causing the bug. If it doesn't work out with your current files no worries, but it could give me some more info about what is going on.

sidevshiy commented 1 month ago

Thank you for quick response! I am going to try it right now!

sidevshiy commented 1 month ago

Yes, seems like it started to work.

[main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 15:15:15,101: rangeToSampleToChecksum: 173177 x 10 [main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 15:15:15,357: numOfSamples: 10 [main] INFO net.maizegenetics.phgv2.api.HaplotypeGraph 2024-07-23 15:15:16,063: numOfRanges: 173177 [main] INFO net.maizegenetics.phgv2.pathing.AlignmentUtils 2024-07-23 15:16:24,295: reading records from the fastq file(s): /fastscratch/liakh/phg2/data/short_reads/sampleread1_R1.fastq, /fastscratch/liakh/phg2/data/short_reads/sampleread2_R2.fastq

zrm22 commented 1 month ago

I made some changes and the build will be triggered(a new release should be available in about 10 minutes). I was unable to fully replicate your issue on my end, but did find a few improvements that I implemented. Can you test out this release and see if things are better. I also put in some more logging so it should give us some more information on pressure points.

sidevshiy commented 1 month ago

Yes, I will check it today or tomorrow and will let you know. Thank you!

sidevshiy commented 1 month ago

Now, it works! Thank you!

zrm22 commented 1 month ago

Just out of curiosity, what was the new runtime for your dataset with the new code?

maize-genetics / phg_v2