tprodanov / locityper

Targeted genotyper for complex polymorphic genes
https://locityper.vercel.app
MIT License
9 stars 0 forks source link

The test_reference is not right #4

Open ld9866 opened 2 weeks ago

ld9866 commented 2 weeks ago

Dear developer: We found the website(https://locityper.vercel.app/test_dataset) reference genome unsuitable for the test analysis.

I think this is caused by the wrong chromosome number, which is very difficult for researchers who study non-human species because I don't know if the chromosome matches the serial number, so hopefully we can fix this part.

The second question is when we did the first step"docker run -v /home/test/Software/test_data:/workdir eichlerlab/locityper:0.15.2 locityper add -d db -v /workdir/hprc.vcf.gz -r /workdir/genome.fa -j /workdir/counts.jf -L /workdir/loci.bed"

The screen showed that:

[00:09:24 DEBUG] locityper add -d db -v /workdir/hprc.vcf.gz -r /workdir/genome.fa -g human -j /workdir/counts.jf -L /workdir/loci.bed [00:09:24 DEBUG] locityper v0.15.2 @ 2024-06-18 00:09:24 [00:09:24 INFO] VCF file contains 90 haplotypes [00:09:24 INFO] Detected jellyfish k-mer size: 25 [00:09:24 INFO] Analyzing MUC6 (chr11:1012824-1036718) [00:09:24 INFO] Extending locus by 2133 bp left and 48167 bp right -> chr11:1010691-1084885 [00:09:24 INFO] Discarded 10 duplicate haplotypes [00:09:25 INFO] Calculating sequence divergence for 80 alleles [00:09:25 INFO] Counting k-mers [00:09:31 INFO] Analyzing MUC16 (chr19:8848845-8981342) [00:09:31 ERROR] Error while analyzing locus MUC16 (chr19:8848845-8981342): Runtime error: Cannot expand locus MUC16 to the left due to a long variant overlapping boundary. Try increasing -e/--expand parameter or manually modifying region boundaries. [00:09:31 WARN] Successfully added 1 loci, failed to add 1 loci [00:09:31 INFO] Total time: 0:00:07.420

this command not produce locus haplotypes!

Can you help me and tell me what happened?

Best!

tprodanov commented 2 weeks ago

Hi there!

Thank you for noticing, the link to the reference genome I had in the docs is indeed incompatible. I now updated the link, and you can download an appropriate reference genome here.

As for the second problem: you only ran the first step (target locus database), you need to run two other steps, WGS preprocessing and genotyping. In addition, you had a partial error during database construction (this seems to be related to the incorrect reference, can you rerun it with the new reference file?).

Best regards, Timofey

ld9866 commented 2 weeks ago

Dear Timofey: I used the new reference genome and all the test data was fine, which is great and helps you understand the software.

I also have a new problem, we had a graphical pan-genome built in Minigraph-Cactus trying to do population typing for the second generation, but we didn't understand what the advantage was over Pangenie.

Because Pangenie's speed is also very fast, it takes about 2 hours to complete an analysis on our server. As we all know, building the system is relatively difficult and of course it is exciting to have your help, if you can please tell me the aspects so that everyone can understand.

Best Dong

tprodanov commented 2 weeks ago

Dear Dong,

Pangenie and Locityper complement each other: Pangenie runs across the whole genome and outputs variant calls, while Locityper is a targeted method, and it predicts larger-scale genotypes. Locityper genotype corresponds to the two closest haplotypes, and, while the results can be converted into VCF format, individual short variants may be less correct, than for Pangenie. However, by finding full-locus haplotypes, Locityper-predicted variants are phased by construction. Additionally, by using full reads/read pairs, Locityper better saves long-scale correspondence between variants than Pangenie, which only uses k-mer counts. Of note, Locityper-predicted haplotypes can be used by themselves, without converting into individual variants, but that obviously depends on what you want to achieve.

In general, Locityper seems to perform better at complex regions and on a longer scales, and Pangenie is slower and may have higher accuracy at slightly easier regions and on individual variants. But it is actually quite nice that Pangenie runs so fast on your data.

In any case, we did not try Locityper on non-human genomes, so hard to say what will happen to Locityper-Pangenie comparison there. Please write me if you have any other questions or concerns!

ld9866 commented 2 weeks ago

Dear Timofey: My work is on the pig genome, so software designed for humans generally works very well in pigs.

As we all know, pigs are not only an important domestic animal, but also one of the hot animals in human organ transplantation.

I will test Locityper with our pan-genome and see how it is used in real work.

Best Dong