Closed gsc74 closed 4 months ago
Dear Ghanshyam,
Thank you for using our tool.
I fixed the bug you sent, and in addition wrote into_fasta.py
script, which would have similar interface to into_vcf
, but would directly produce FASTA output. Hopefully, this will help you, and please write me if you encounter any additional problems.
On a different note: it looks like your regions are quite short (<10 kb). In general, I think it may be helpful to have larger regions, as then Locityper would have more information about corresponding blocks. Optimally, I would take 20-100 kb per region, although even longer regions are possible. In addition, the tool tries to extend region boundaries if there are variants in the VCF right on the edge of the region. So, after this procedure you will probably get significantly overlapping regions. This is not necessarily a problem, but just be aware of that.
@tprodanov , I encountered the following error, while using the script into_fasta.py
, kindly check
Traceback (most recent call last):
File "/home/ghanshyam/apps/locityper/extra/into_fasta.py", line 114, in <module>
main()
File "/home/ghanshyam/apps/locityper/extra/into_fasta.py", line 86, in main
preds, samples = into_vcf.load_predictions(f, '1')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ghanshyam/apps/locityper/extra/into_vcf.py", line 30, in load_predictions
reads=int(row['total_reads']),
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '*'
The command used is as follows:
python3 /home/ghanshyam/apps/locityper/extra/into_fasta.py -i gts.csv -d db -o APD_rec_LT.fasta
@gsc74 Thank you! I think I fixed this issue, can you check now?
Thank you!, I'm able to get haplotypes at regions mentioned in a bed
file. I was also wondering, if it's possible to get overall end-to-end diploid haplotypes not just diploid haplotypes at a given loci. For example I just gave loci.bed
as input for genotype. which looks as
0 0 4920303 MHC
The gts.csv
output I got is
sample locus genotype quality total_reads unexpl_reads weight_dist warnings
gts MHC 0,HG00438.1 474.8 61736 2228 0.00000 *
The two haplotypes I got from the CSV file are reference
genome i.e. MHC-CHM13
and the other haplotype has edit distance of 18
with MHC-HG00438.1
haplotype.
Maybe it will be possible in distant future versions, but currently I would advise against trying to predict full diploid haplotypes. By design, Locityper tries to select two input haplotypes that would explain the input dataset. Consequently, for the full region it would only work if there were two haplotypes in the reference panel, that are very similar to the input dataset. On a large scale, and for such a polymorphic region, I am sure we don't have enough assembled haplotypes for this to work properly. And in general, we don't model recombinations, so predicting very long haplotypes only works if recombined haplotypes are already in the input panel.
Thank you!
@tprodanov, I want to reconstruct the haplotypes from MHC region. I have used the following commands.
How to get reconstructed haloplotypes in
fasta
format?I was exploring the following method to get haplotypes:
While converting to VCF, I get the following error:
I see, a error message in log
The MHC_GIAB_CHM13.bed file is extracted from GIAB CHM13-HG002 small variants benchmark by filtering the coordinates of
MHC
region with respect toCHM13 v2.0
reference and offsetting to start from0
and has the following lines: