Problem with memory usage during KAGE indexing

isaamael commented 1 month ago

Hi！

thanks for your advanced software！ : )

How can I determine the memory size required for indexing? I always get “not enough memory” when I run the kage index on a VCF with 2.5 million loci and two samples. Is this reasonable?

kage index -r $TSref -v $GPvcf -o $outputdir/KAGEindex/index -k 31 -t 1

erro message

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 6.43 GiB for an array with shape (863346693,) and data type int64

isaamael commented 1 month ago

Hi！ I have completed indexing by allocating 100GB of memory. A new issue has arisen when I attempted to perform genotyping on one of my test samples, resulting in an error:

start genotype
Tue May 14 11:58:13 CST 2024
INFO:root:Read coverage is set to 0.100
INFO:root:Reading all indexes from an index bundle
INFO:root:Will do imputation with glimpse
INFO:root:Will count kmers.
INFO:root:N bytes of reads: 0
INFO:root:Approx number of chunks of 10000000 bytes: 0
Traceback (most recent call last):
  File "/public/home/xuruiqiang/software/python/Python-3.10.14/bin/kage", line 33, in <module>
    sys.exit(load_entry_point('kage-genotyper==2.0.6', 'console_scripts', 'kage')())
  File "/public/home/xuruiqiang/software/python/Python-3.10.14/lib/python3.10/site-packages/kage/command_line_interface.py", line 52, in main
    run_argument_parser(sys.argv[1:])
  File "/public/home/xuruiqiang/software/python/Python-3.10.14/lib/python3.10/site-packages/kage/command_line_interface.py", line 552, in run_argument_parser
    args.func(args)
  File "/public/home/xuruiqiang/software/python/Python-3.10.14/lib/python3.10/site-packages/kage/command_line_interface.py", line 98, in genotype
    node_counts = get_kmer_counts(kmer_index, args.kmer_size, args.reads, config.n_threads, args.gpu)
  File "/public/home/xuruiqiang/software/python/Python-3.10.14/lib/python3.10/site-packages/kage/command_line_interface.py", line 58, in get_kmer_counts
    return NodeCounts(map_bnp(Namespace(
  File "/public/home/xuruiqiang/software/python/Python-3.10.14/lib/python3.10/site-packages/kmer_mapper/command_line_interface.py", line 112, in map_bnp
    file = bnp.open(args.reads)
  File "/public/home/xuruiqiang/software/python/Python-3.10.14/lib/python3.10/site-packages/bionumpy/io/files.py", line 172, in bnp_open
    suffix = path.suffixes[-1]
IndexError: list index out of range
Tue May 14 11:59:10 CST 2024

I ran the following code:

echo "start genotype"
date
kage genotype \
-i $KAGEindex  \
-r <(zcat $fq1 $fq2) \
-s 1_A01 \
-o $outdir/1x/1_A01_1x.vcf \
--average-coverage 1 -k 31 -t 10 --glimpse $GPvcf
date

Do you have any ideas about this error? Thank you in advance for your advice!

ivargr commented 1 month ago

Thanks for reaching out!

I think the error is because KAGE needs a file as input for the reads. So using <(zcat $fq1 $fq2) will unfortunately not work. Could you try to make a file for the reads, that ends in either .fa, fa.gz, fq or fq.gz and use that after -r ?

The error message should have been better, though, I will fix that :)

isaamael commented 1 month ago

Hi ！ Any thanks are not unwarranted! However, sad, a new issue has arisen. :(

After activating --glimpse vcf.gz, kage genotype will automatically download GLIMPSE_chunk_static,

....
INFO:root:Computing GLs took 0.263 sec
INFO:root:Creating genotype strings took 55.672 sec
INFO:root:Writing to vcf took 88.757 sec
INFO:root:Downloading GLIMPSE_chunk_static
INFO:root:Downloading GLIMPSE_phase_static

urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

even though I have already deployed GLIMPSE2 locally.

[usr@login1 GLIMPSE2]$ GLIMPSE2_chunk_static

[GLIMPSE2] Split chromosomes into chunks
  * Authors              : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact              : simone.rubinacci@unil.ch & olivier.delaneau@unil.ch
  * Version              : GLIMPSE2_chunk v2.0.0 / commit = 8ce534f / release = 2023-06-29
  * Citation             : Nature Genetics 55, 1088–1090 (2023). https://doi.org/10.1038/s41588-023-01438-3
  *                      : Nature Genetics 53, 120–126 (2021). DOI: https://doi.org/10.1038/s41588-020-00756-0
  * Run date             : 14/05/2024 - 17:16:20

ERROR: You must specify one input file using --input

I am confused about whether kage download every time it is run? can i skip the download because the network between HPC nodes may vary. Or perhaps I should actually install GLIMPSE v1 locally? need help

isaamael commented 1 month ago

I downloaded version 1, and now the code is working fine, i get vcf However, whether it's after inputting or in the original VCF, all the positions are 0/0. What's going on?

no erro happen

ivargr commented 1 month ago

KAGE only works with GLIMPSE 1. I think as long as you have the GLIMPSE binaries in your path, that should work if the download fails. But I could add an option to skip downloading the glimpse binaries (I think it will not download again if they already exist, so if you make sure you have the binaries downloaded once it should be fine).

Are you sure all positions are 0/0? Typically, most variants will be genotyped as 0/0 (e.g. 99%), so if you only look at part of the output it may seem like most variants are 0/0). But if you have any reads supporting other genotypes, there will usually be some that are not.

isaamael commented 1 month ago

I've read through all the issues, helpful for me.

GLIMPSE is indeed only downloaded once, and everything works fine when I place them in the corresponding path. I'm considering switching to v2 because I'm conducting some tests with ultra-low coverage (0.1x). Can I simply modify a script somewhere to achieve this?

I've tested the data with both 0.1x and 10x , and yes, the original VCF files are all 0/0, but after imputing, they become 1/1. The same data, when processed with vg call, produces variation as expected.

I'm not sure where the problem lies. Here's the scenario: I assembled two plant genomes, A and B, each about 800Mb. Then, I used the minigraph-cactuspipeline to generate GFA and VCF files, with A as the reference. The VCF files generated by the MC pipeline have GT as 0or 1, so I manually rewrote the VCF files, replacing genotype 1with 1|1. These files were then used to create indexes for KAGE and GLIMPSE. Finally, resequencing data from some population samples was used for KAGE genotype, resulting in the situation I described.

bgzip -c $MCdir/GP.phasing.vcf > $MCdir/GP.phasing.vcf.gz
tabix -p vcf $MCdir/GP.phasing.vcf.gz
GPvcf=$MCdir/GP.phasing.vcf.gz
mkdir -p $outputdir/KAGEindex
kage index -r $TSref -v $GPvcf -o $outputdir/KAGEindex/index -k 31 -t 10
kage glimpse_index -p ${GPvcf} -o $outputdir/KAGEindex

Could there be an issue with the indexing? Or perhaps I've made some beginner mistakes?

index log here log.txt

ivargr commented 1 month ago

It is a bit difficult to know exactly why all genotype are 0/0 without looking at the data you used. However, I think your approach with only using 2 haplotypes to build the graph might not work so well. Both KAGE and GLIMPSE really need a bit more than that (ideally 5-10 genomes or more) in order for the methodology to work well, I think. If you only have two genomes or two haplotypes, you are 1) likely to not have most of the variants that a new sample has and 2) you will have very little information about correlations between variants (which is what GLIMPSE need).

Also, KAGE needs haplotypes in the vcf that cover the alleles of your variants. It seems maybe from the log you provided that very many variants are ignored because no haplotypes cover them.

Regarding the use of GLIMPSE 2, I think that might require some changes in the KAGE codebase, but that is something I could look into if wanted :)

kage-genotyper / kage

Problem with memory usage during KAGE indexing #13