kage-genotyper / kage

Alignment-free genotyper for SNPs and short indels, implemented in Python.
GNU General Public License v3.0
48 stars 2 forks source link

Error when running kage index #7

Open unavailable-2374 opened 11 months ago

unavailable-2374 commented 11 months ago

Hi,

I encountered an error while running the kage index, which seems to be an issue during the parsing of the VCF file. My input files were a chromosome from the genome and a VCF file from PGGB, which had been normalized by BCFtools.

Here is my log file. log.txt

Best regards!

ivargr commented 11 months ago

Hi!

Thanks for sharing! I'm a bit unsure what causes this problem. Does your VCF contain genotypes? Would you be able to share a small subset of the VCF, e.g the header and the first 10 variants or so?

unavailable-2374 commented 10 months ago

Hi,

Sorry for the delay, here is a subset of the VCF.

Hope it's useful! demo.vcf.txt

Sincerely, Shuo

ivargr commented 10 months ago

Thanks for providing the vcf!

I think the genotype columns in your vcf are invalid:

PN1 5147    >4637497>4637500    A   T   60  .   AC=2;AF=0.1;AN=20;AT=>4637497>4637499>4637500,>4637497>4637498>4637500;NS=20;LV=0   GT  0   .   0   0   .   .   1   0   0   0   0   0   0   0   0   1   .   .   .   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .   0   0   0   0   0   0   .   .

Valid genotypes would be e.g 0|1 or 0/1, not only 0 or 1. Could something have gone wrong in your preprocessing of the VCF file?

unavailable-2374 commented 10 months ago

Hello,

This VCF was produced by PGGB, which uses assemblies to construct the graph genome. Therefore there is only one genotype per sample. Or can we think of it as a purebred?

Best, Shuo

ivargr commented 10 months ago

Ah, I see. So do I understand you correctly that the individuals are actually diploid, but they are purebred so that the genotype is always homozygous, so that 0 is in principle 0/0?

If that is the case, this should work with KAGE I think, but I will need to push a minor fix for that :)

ivargr commented 10 months ago

I've added a small tool in kage now that lets you convert the vcf so that genotypes are diploid before running KAGE. I think that might be an okay solution for now. You do this by running kage convert_purebread_vcf before running kage index, like this:

kage convert_purebread_vcf -v demo.vcf -o converted.vcf
kage index -v converted.vcf .....

The first command simply replaces haploid genotypes with diploid so that they are compatible with kage.

You will need to update to version 2.0.1 first with pip install kage-genotyper=2.0.1.

Note that KAGE will then genotype this as diploid, which should be fine, except that it may give out heterozygous genotypes when it thinks that is likely. This can easily be fixed in a posprocessing step by converting these to homozygous by picking the most likely homozygous genotype instead, I can fix a solution for this if you need that.

Let me know if I've misunderstood something, or if the solution does not work :)

unavailable-2374 commented 10 months ago

Thank you for getting back to me so quickly!

There is one issue that may require further clarification. The mutation information for each sample is obtained from the assembly. However, since the individual is divided into hap1 and hap2 for assembly purposes, each assembly can be treated as a haploid. I am unsure if Kage can handle this situation and demonstrate the genotype by treating the haploid as a pure diploid.

Best~

ivargr commented 10 months ago

I see! I think I maybe understand the VCF now. So I assume for instance BG_hap1_chr01 and BG_hap2_chr01 are the two haplotypes from the same individual? If that is the case, the best thing to do would be two group two and two haplotypes into a single diploid individual in the VCF before running KAGE.

If you can confirm that I've understood this correctly, I can fix a preprocessing script that handles this.

However, I see there are also some sample names where it is less clear whether they are from the same individual or seperate, some examples are:

Vcs201  Ved01   Ves01   Vme01   Vmu101  Vmu201  Vri01   Vsc01   Vsp01

Are these just single haploid assemblies?

unavailable-2374 commented 10 months ago

Yes! That's it! But unfortunately, not all individuals were haplotype-resolved. so we get some samples named Ved and Ves, these samples are just primary assemblies.

unavailable-2374 commented 10 months ago

Kindly suggest that trying to be compatible with the VCFs generated by PGGB or Minigraph-Cactus could be important. We usually use the set of variants they produce for downstream analyses 🤟.

unavailable-2374 commented 10 months ago

Hi,

I replaced the 1 in the VCF with 1/1; 0 with 0/0; and '.' with './.'. I tried treating all loci as homozygous, but it still resulted in this error.

''' Estimating global kmer counts: 12%|█████████▋ | 1/8 [00:23<02:46, 23.85s/haplotype] Traceback (most recent call last): File "/public/tools/python/bin/kage", line 8, in sys.exit(main()) File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 51, in main run_argument_parser(sys.argv[1:]) File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 546, in run_argument_parser args.func(args) File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 297, in make_index_cli r = make_index(args.reference, args.vcf, args.out_base_name, File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 156, in make_index scorer = make_kmer_scorer_from_random_haplotypes(graph, haplotype_matrix, k, n_haplotypes=8, modulo=modulo * 40 + 11) File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/kmer_scoring.py", line 140, in make_kmer_scorer_from_random_haplotypes for subkmers in kmers: File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/graph.py", line 228, in get_haplotype_kmers sequence = self.sequence(haplotype, reverse_complement=reverse_complement).ravel() File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/graph.py", line 129, in sequence result = zip_sequences(ref_sequence, variant_sequences, encoding) File "/public/tools/python/lib/python3.10/site-packages/kage/util.py", line 69, in zip_sequences new[1:-1:2] = b File "/public/tools/python/lib/python3.10/site-packages/npstructures/raggedarray/indexablearray.py", line 102, in setitem self._set_data_range(index, value.ravel()) File "/public/tools/python/lib/python3.10/site-packages/bionumpy/encoded_array.py", line 175, in _set_data_range super()._set_data_range(idx, as_encoded_array(data, self._encoding).raw()) File "/public/agis/zhouyongfeng_group/caoshuo/tools/python/lib/python3.10/site-packages/bionumpy/encoded_array.py", line 597, in as_encoded_array return target_encoding.encode(s) File "/public/tools/python/lib/python3.10/site-packages/bionumpy/encoded_array.py", line 60, in encode out = EncodedArray(self._encode(data), self) File "/public/tools/python/lib/python3.10/site-packages/bionumpy/encodings/alphabet_encoding.py", line 33, in _encode raise EncodingError(f"Error when encoding {''.join(chr(c) for c in byte_array.ravel()[0:100])} " bionumpy.encodings.exceptions.EncodingError: ("Error when encoding CAAAAACCCGAACCTATCACTGCCCCAGTCAACTCACCAATCCTCAAAACCCCATCTGCAGAGTATTCAGAAGAAATTAATTAAGAACAGTGATCTTCTA to AlphabetEncoding. Invalid character(s): ['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N'][78, 78, 78, 78, 78, 78, 78, 78, 78, 78]", 174333) ''' Best~

ivargr commented 10 months ago

It seems the problem now is that some of your variants contains Ns, which is not ideal as KAGE is not able to compute kmers for them. Unfortunately, the error message was not so descriptive.

I've pushed an update now that can ignore the Ns so that KAGE doesn't crash when there are Ns in variant alleles, but ideally one should filter away variants with N's in alleles before I think.

But if you try the latest version (pip install kage-genotyper==2.0.2) hopefully it should not crash now.

I'm still thinking about the best way to deal with individuals that are not haplotype-resolved. I think maybe it makes most sense to group individuals that are haplotype-resolved into diploid genotypes and ignore the rest. But if you get KAGE index to run for now, we can think more about it after that :)

unavailable-2374 commented 10 months ago

Thank you for the timely update. Regarding the mutation information, does ignoring Ns mean disregarding it entirely? This may not be appropriate in the current situation. Ignoring the whole variant may result in the loss of important information. During the assembly process, except for the latest T2T genome, it is necessary to connect multiple contigs with Ns when assembling from contig to scaffold level.

What if we say that only the N character in the variant message is ignored, rather than the entire variant message, do you think that would make a bit more sense?

ivargr commented 10 months ago

The fix now is not ignoring variants with Ns completely, but instead is just encoding Ns as A instead as a "hack" to be able to compute kmers. From my experience this is, although a bit dirty, usually fine as there are few variants that have alleles with Ns in them. The alternative can be to skip these variants, i.e. filter them away before running kage. I guess you have quite few variants where the alleles have Ns in them, since Ns are mostly added when scaffolding, and these Ns will end up on the reference between variants and will not affect anything. So I don't think these Ns should be much of a problem in practice.

unavailable-2374 commented 10 months ago

Thank you so much ! I will test both options.

unavailable-2374 commented 10 months ago

Hi

I split the VCF file by chromosome. All chromosomes were successfully indexed except for chromosome 14, which reported the following error message:

''' (find_signatures_for_chunk_wrapper pid=82840) INFO:root:0/3 signatures removed because they had score above 254. Traceback (most recent call last): File "/public/tools/python/bin/kage", line 8, in sys.exit(main()) File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 52, in main run_argument_parser(sys.argv[1:]) File "/public/tools/python/lib/python3.10/site-packages/kage/command_line_interface.py", line 552, in run_argument_parser args.func(args) File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 298, in make_index_cli r = make_index(args.reference, args.vcf, args.out_base_name, File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/main.py", line 186, in make_index signatures = get_signatures(k, paths, scorer, chunk_size=signatures_chunk_size, spacing=0, File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 1068, in get_signatures all_signatures = ray.get(all_signatures) File "/public/tools/python/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, *kwargs) File "/public/tools/python/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, *kwargs) File "/public/tools/python/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TypeError): ray::find_signatures_for_chunk_wrapper() (pid=82849, ip=192.168.10.248) File "/public/tools/python/lib/python3.10/site-packages/awkward/operations/ak_sum.py", line 210, in sum return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs) File "/public/tools/python/lib/python3.10/site-packages/awkward/operations/ak_sum.py", line 277, in _impl out = ak._do.reduce( File "/public/tools/python/lib/python3.10/site-packages/awkward/_do.py", line 333, in reduce next = layout._reduce_next( File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1618, in _reduce_next outcontent = trimmed._reduce_next( File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1618, in _reduce_next outcontent = trimmed._reduce_next( File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1502, in _reduce_next ) = self._rearrange_prepare_next(outlength, parents) File "/public/tools/python/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py", line 1693, in _rearrange_prepare_next distincts = Index64.empty(outlength maxcount, index_nplike) File "/public/tools/python/lib/python3.10/site-packages/awkward/index.py", line 115, in empty return Index(nplike.empty(length, dtype=dtype), nplike=nplike) File "/public/tools/python/lib/python3.10/site-packages/awkward/_nplikes/array_module.py", line 111, in empty return self._module.empty(shape, dtype=dtype) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 204. GiB for an array with shape (27376795028,) and data type int64

During handling of the above exception, another exception occurred:

ray::find_signatures_for_chunk_wrapper() (pid=82849, ip=192.168.10.248) File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 1081, in find_signatures_for_chunk_wrapper return find_signatures_for_chunk(params) File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 1105, in find_signatures_for_chunk sv_min_size=50 // max(1, spacing)).run(add_dummy_count_to_index) File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 352, in run self._score_signatures(add_dummy_count_to_index=add_dummy_count_to_index) File "/public/tools/python/lib/python3.10/site-packages/kage/indexing/signatures.py", line 334, in _score_signatures window_scores = np.sum(scores, axis=2) # or np.min? File "<__array_function__ internals>", line 200, in sum File "/public/tools/python/lib/python3.10/site-packages/awkward/highlevel.py", line 1527, in __array_function__ return ak._connect.numpy.array_function( File "/public/tools/python/lib/python3.10/site-packages/awkward/_connect/numpy.py", line 102, in array_function return function(args, *kwargs) File "/public/tools/python/lib/python3.10/site-packages/awkward/_connect/numpy.py", line 142, in ensure_valid_args return function(args, **kwargs) File "/public/tools/python/lib/python3.10/site-packages/awkward/operations/ak_sum.py", line 298, in _nep_18_impl_sum return sum(a, axis=axis, keepdims=keepdims) File "/public/tools/python/lib/python3.10/site-packages/awkward/_dispatch.py", line 38, in dispatch with OperationErrorContext(name, args, kwargs): File "/public/tools/python/lib/python3.10/site-packages/awkward/_errors.py", line 85, in exit self.handle_exception(exception_type, exception_value) File "/public/tools/python/lib/python3.10/site-packages/awkward/_errors.py", line 95, in handle_exception raise self.decorate_exception(cls, exception) File "/public/tools/python/lib/python3.10/site-packages/awkward/_errors.py", line 119, in decorate_exception new_exception = cls(self.format_exception(exception)) TypeError: _ArrayMemoryError.init() missing 1 required positional argument: 'dtype' (pid=82839) /public/tools/python/lib/python3.10/site-packages/bionumpy/encodings/vcf_encoding.py:98: RuntimeWarning: invalid value encountered in cast [repeated 12x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) (pid=82839) _lookup[[ord(c) for c in ('0', '1', '.')]] = np.array([0, 1, np.nan]) [repeated 12x across cluster] '''

thanks !

ivargr commented 10 months ago

Hi!

It seems the system is running out of memory: numpy.core._exceptions._ArrayMemoryError: Unable to allocate 204. GiB for an array with shape (27376795028,) and data type int64

Can I ask how many variants there are and how many individuals you have in the vcf?

unavailable-2374 commented 10 months ago

Hi !

My VCF contains only structural variations, about 280,000 of them, across 27 samples, at 1.8G.

Thanks

ivargr commented 10 months ago

I've looked into this now, and think there might ba possible problem with increased memories in some part of the indexing process if some variant alleles are very big. I've pushed a fix now adresses that problem and that hopefully solves it.

Could you try updating to version 2.0.3 and see if that helps? pip install kage-genotyper==2.0.3

If you still get problems with too much memory being used, I'll look further into it.

unavailable-2374 commented 10 months ago

Hi

It works! Thanks a lot !

Best

On Jan 10, 2024, at 22:31, ivargr @.***> wrote:

I've looked into this now, and think there might ba possible problem with increased memories in some part of the indexing process if some variant alleles are very big. I've pushed a fix now adresses that problem and that hopefully solves it.

Could you try updating to version 2.0.3 and see if that helps? pip install kage-genotyper==2.0.3

If you still get problems with too much memory being used, I'll look further into it.

— Reply to this email directly, view it on GitHub https://github.com/kage-genotyper/kage/issues/7#issuecomment-1884959467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWACOPIQWQ3SUEKTJBXGGILYN2Q4VAVCNFSM6AAAAABBGBEX76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUHE2TSNBWG4. You are receiving this because you authored the thread.

ivargr commented 10 months ago

Great! Let me know if you run into any other issues while genotypig :)