tprodanov / locityper

Targeted genotyper for complex polymorphic genes
https://locityper.vercel.app
MIT License
13 stars 0 forks source link

Invalid data: Locus alleles are too short #7

Open yeeus opened 3 weeks ago

yeeus commented 3 weeks ago

Nice tool! But when I used locityper to genotyping HGDP reads on my interested loci (five prime utrs which are usually shorter than 500 bp) with the command:

locityper add -d db -v vcf/chr21.vcf.gz -r chr21.fa -j counts.jf -L test.bed -g GRCh38_chr21 -w 1 -e 0

It reports that

[13:54:25 DEBUG] locityper add -d db -v vcf/chr21.vcf.gz -r chr21.fa -j counts.jf -L test.bed -g GRCh38_chr21 -w 1 -e 0
[13:54:25 DEBUG] locityper v0.16.12 @ 2024-10-29 13:54:25
[13:54:26  INFO] VCF file contains 540 haplotypes
[13:54:26  INFO] Detected jellyfish k-mer size: 25
[13:54:26  INFO] Analyzing ENST00000651438.1 (chr21:45405165-45405233)
[13:54:26 ERROR] Error while analyzing locus ENST00000651438.1 (chr21:45405165-45405233):
        Invalid data: Locus alleles are too short for locus ENST00000651438.1 (chr21:45405165-45405233) (shortest: 69 bp)
[13:54:26 ERROR] Failed to add 1 loci
[13:54:26  INFO] Total time: 0:00:00.310

It seems locityper can't be used on short loci?

And another question, the vcf file I used above was generated by MC and therefore was filtered by vcfbub which would result in less variants in my case. I want to know can I used the raw vcf? And if I can't, can I input all samples fasta file about one locus in bed file? I tried this:

cat test.bed
chr21   45405164    45405233    ENST00000651438.1   ./GRCh38.test.fa
chr21   45405164    45405233    ENST00000651438.1   ./one_sample.test.fa

locityper add -d db -r chr21.fa -j counts.jf -L test.bed
[13:26:28 DEBUG] locityper add -d db -r chr21.fa -j counts.jf -L test.bed
[13:26:28 DEBUG] locityper v0.16.12 @ 2024-10-29 13:26:28
[13:26:28 ERROR] Finished with an error:
Invalid input: Locus name 'ENST00000609664.2' appears at least twice

OMG, it seems I can't provide all samples about one locus? What can I do? Please help me.. Best wishes!

tprodanov commented 2 weeks ago

Hi Quanyu,

Thank you for using our tool and sorry for a bit delayed answer.

Regarding short alleles: Tomorrow I will publish an update that will not fail immediately if alleles are too short. Nevertheless, they should be at least somewhat long (window size + 2 * boundary size, --window and --boundary in locityper preproc). Basically, if allele is too short, we don't have enough information to distinguish it from other alleles. In addition, it will probably not be very accurate if some alleles are very short and some have normal size. If you have locus assemblies upstream/downstream from the gene of interest, I would recommend you to use them as well.

As for the VCF input: as I understand, even though vcfbub will reduce the number of variants, all the information remains, you will simply have longer variants. The problem with non-vcfbub`ed VCF is that it will have overlapping variants, which are not so easy to traverse.

Finally, for the FASTA input: try putting all alleles in a single FASTA file and then providing it.