Extract haplotypes from CSV file

gsc74 commented 4 months ago

@tprodanov, I want to reconstruct the haplotypes from MHC region. I have used the following commands.

jellyfish count --canonical --lower-count 2 --out-counter-len 2 --mer-len 25 --threads 8 --size 3G --output counts.jf ../../MSA/Graphs/hprc_haps/MHC-CHM13.ref.fa
locityper add -d db -v MHC_50-MC.vcf.gz -r ../../MSA/Graphs/hprc_haps/MHC-CHM13.ref.fa -g 0 -j counts.jf -L MHC_GIAB_CHM13.bed --threads 32
locityper preproc -i APD_sim.fq -r ../../MSA/Graphs/hprc_haps/MHC-CHM13.ref.fa -j counts.jf -o bg -b "0:1-492303" -t "illumina"
locityper genotype -i APD_sim.fq -d db -p bg -o gts --threads 32
python3 /home/ghanshyam/apps/locityper/extra/into_csv.py -i ./gts -o gts.csv

How to get reconstructed haloplotypes in fasta format?

I was exploring the following method to get haplotypes:

python3 /home/ghanshyam/apps/locityper/extra/into_vcf.py -i gts.csv -d db -v MHC_50-MC.vcf.gz -g 0 -o temp && mv temp/MHC.vcf.gz temp/APD_LT_genotyping.vcf.gz
bcftools view -i 'GT[*]="alt"' temp/APD_LT_genotyping.vcf.gz | bgzip > temp/APD_LT_genotyping_no_homo.vcf.gz && tabix -p vcf temp/APD_LT_genotyping_no_homo.vcf.gz
bcftools consensus -f ../../MSA/Graphs/hprc_haps/MHC-CHM13.ref.fa -o APD_rec_LT.fasta temp/APD_LT_genotyping_no_homo.vcf.gz

While converting to VCF, I get the following error:

Traceback (most recent call last):
  File "/home/ghanshyam/apps/locityper/extra/into_vcf.py", line 240, in <module>
    main()
  File "/home/ghanshyam/apps/locityper/extra/into_vcf.py", line 212, in main
    preds, samples = load_predictions(f, args.filtering)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghanshyam/apps/locityper/extra/into_vcf.py", line 27, in load_predictions
    qual=float(row['quality']),
         ^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'NA'

I see, a error message in log

[04:37:30  INFO] Analyzing MHC35
[04:37:30  INFO]     Mapping reads to db/loci/MHC35/haplotypes.fa.gz
[04:37:30 DEBUG]     ~/.conda/envs/locityper/bin/strobealign --no-progress -M 108 -N 108 -S 0.8 -f 0.001 -k 15 --eqx -r 150 -t 32 db/loci/MHC35/haplotypes.fa.gz gts/loci/MHC35/reads.fq | ~/bin/samtools view -b -o gts/loci/MHC35/aln.tmp.bam
[04:37:30 DEBUG]     Finished in 0:00:00.027
[04:37:30  INFO]     Calculating read alignment probabilities
[04:37:30  INFO]     Loading read alignments
[04:37:30 DEBUG]     6878 k-mers unique to this locus, 21911 k-mers appear off target
[04:37:30 DEBUG]     Use 0 reads. Discard 3 unmapped, 6 out of bounds and 94 with few unique k-mers
[04:37:30 ERROR] [MHC35] No available reads
[04:37:30  INFO]     [MHC35] Successfully finished in 0:00:00.038

The MHC_GIAB_CHM13.bed file is extracted from GIAB CHM13-HG002 small variants benchmark by filtering the coordinates of MHC region with respect to CHM13 v2.0 reference and offsetting to start from 0 and has the following lines:

0   0   202403  MHC1
0   206534  365411  MHC2
0   365753  377948  MHC3
0   378344  422870  MHC4
0   423510  497635  MHC5
0   497737  857668  MHC6
0   858107  1205805 MHC7
0   1205907 1483240 MHC8
0   1483378 1513005 MHC9
0   1513454 1931200 MHC10
0   1931302 1950953 MHC11
0   1951352 1954254 MHC12
0   1954356 2229134 MHC13
0   2229426 2298444 MHC14
0   2299327 2335800 MHC15
0   2336026 2377154 MHC16
0   2377256 2469338 MHC17
0   2470657 2486861 MHC18
0   2486963 2496169 MHC19
0   2496605 2509311 MHC20
0   2513527 2531245 MHC21
0   2531496 2618520 MHC22
0   2618675 2711537 MHC23
0   2712799 2795725 MHC24
0   2795827 2816490 MHC25
0   2817556 2864142 MHC26
0   2864370 2901573 MHC27
0   2906337 2928215 MHC28
0   2928522 2942743 MHC29
0   2942845 3881427 MHC30
0   3882305 3920440 MHC31
0   3921056 3927298 MHC32
0   4056866 4073542 MHC33
0   4074237 4078843 MHC34
0   4106464 4109440 MHC35
0   4109542 4111853 MHC36
0   4112286 4113777 MHC37
0   4113952 4121572 MHC38
0   4129059 4137469 MHC39
0   4137571 4159866 MHC40
0   4160297 4161576 MHC41
0   4162023 4222544 MHC42
0   4222708 4224058 MHC43
0   4224221 4246472 MHC44
0   4246574 4251792 MHC45
0   4251894 4267496 MHC46
0   4267775 4481491 MHC47
0   4481886 4565376 MHC48
0   4565818 4920303 MHC49

tprodanov commented 4 months ago

Dear Ghanshyam,

Thank you for using our tool. I fixed the bug you sent, and in addition wrote into_fasta.py script, which would have similar interface to into_vcf, but would directly produce FASTA output. Hopefully, this will help you, and please write me if you encounter any additional problems.

On a different note: it looks like your regions are quite short (<10 kb). In general, I think it may be helpful to have larger regions, as then Locityper would have more information about corresponding blocks. Optimally, I would take 20-100 kb per region, although even longer regions are possible. In addition, the tool tries to extend region boundaries if there are variants in the VCF right on the edge of the region. So, after this procedure you will probably get significantly overlapping regions. This is not necessarily a problem, but just be aware of that.

gsc74 commented 4 months ago

@tprodanov , I encountered the following error, while using the script into_fasta.py, kindly check

Traceback (most recent call last):
  File "/home/ghanshyam/apps/locityper/extra/into_fasta.py", line 114, in <module>
    main()
  File "/home/ghanshyam/apps/locityper/extra/into_fasta.py", line 86, in main
    preds, samples = into_vcf.load_predictions(f, '1')
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghanshyam/apps/locityper/extra/into_vcf.py", line 30, in load_predictions
    reads=int(row['total_reads']),
          ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '*'

The command used is as follows:

python3 /home/ghanshyam/apps/locityper/extra/into_fasta.py -i gts.csv -d db -o APD_rec_LT.fasta

tprodanov commented 4 months ago

@gsc74 Thank you! I think I fixed this issue, can you check now?

gsc74 commented 4 months ago

Thank you!, I'm able to get haplotypes at regions mentioned in a bed file. I was also wondering, if it's possible to get overall end-to-end diploid haplotypes not just diploid haplotypes at a given loci. For example I just gave loci.bed as input for genotype. which looks as

0   0   4920303 MHC

The gts.csv output I got is

sample  locus   genotype    quality total_reads unexpl_reads    weight_dist warnings
gts MHC 0,HG00438.1 474.8   61736   2228    0.00000 *

The two haplotypes I got from the CSV file are reference genome i.e. MHC-CHM13 and the other haplotype has edit distance of 18 with MHC-HG00438.1 haplotype.

tprodanov commented 4 months ago

Maybe it will be possible in distant future versions, but currently I would advise against trying to predict full diploid haplotypes. By design, Locityper tries to select two input haplotypes that would explain the input dataset. Consequently, for the full region it would only work if there were two haplotypes in the reference panel, that are very similar to the input dataset. On a large scale, and for such a polymorphic region, I am sure we don't have enough assembled haplotypes for this to work properly. And in general, we don't model recombinations, so predicting very long haplotypes only works if recombined haplotypes are already in the input panel.

gsc74 commented 4 months ago

Thank you!

tprodanov / locityper

Extract haplotypes from CSV file #5