Closed dandanpeng closed 7 months ago
Hi @dandanpeng here are my thoughts on your questions:
(1) Here you mention 200kb since you use -end 2e5
? It is definitely no problem to use singer_master
directly instead of parallel_singer
. One thing to note is that SINGER needs at least >100 mutations in the region to work, otherwise you probably don't want to trust the results anyways.
(2) The VCF file you provided looks like it is haploid VCF file, instead of diploid. It is supposed to look like Sample 1
with 0|1
to work for SINGER. But here Sample 1
appears only with 0
. I think I will write a new parser for haploid VCF later, but before that happens, you can change the simulation to output diploid VCF.
(3) Ah the number of burn-in and thinning. It is rather subtle, and heavily depends on the sample size (My real data is much larger than Debora's test simulation data). The reason for Debora's setting is mainly the runtime constraints for running ARGweaver. I think for <50 haplotypes it is more than enough for run 2000 iterations with 1000 as burn-in (from my own experience).
Let me know if I mis-interpret any of your questions or you have any follow-up questions!
Thank you for the explanation!
For the second question, sorry I messed up the table with the pipe in MARKDOWN. I did use the diploid data and just fixed the input in the original post. And now you can see that the correspondence between the VCF file and variants in tree sequence.
Can you share the vcf file with me? Or maybe show 20 lines or so in the markdown?
Sure! Here is the vcf file I ran with. Since GitHub doesn't support the VCF file type, I changed the suffix to TXT, hope it still works on your side. singer.txt
Oh I know what is wrong! The program interprets Sample 1
as 2 names, one for Sample
and another for 1
. Could you try naming them Sample_1
, Sample_2
instead?
That solves the issue! Thank you for the help!
Hi, @YunDeng98,
I have three questions about the usage of SINGER:
I have simulated a very short sequence (2Kb) and want to run the ARG inference for the entire sequence instead of a specific region. Would it be OK to call singer_master and set the
-start
and-end
parameters as 0 and 2e5, respectively? My rationale for this is that, for such a short sequence, cutting it into windows and parallelizing the running (as parrallel_singer does) may be unnecessary.My VCF data comprises 10 individuals, each with 2 alleles for one SNP. I expected that each marginal tree in the output tree sequence would have 10 tips (each tip corresponds to 1 individual) or 20 tips (each tip corresponds to 1 chromosome). However, I found that there are 40 tips in each marginal tree. I inspected the variants in the tree sequence and observed that the first 20 alleles correspond to the alleles shown in the VCF file, and all the latter 20 alleles are 0, indicating potential auto-filling with 0. Any insights into what might be causing this discrepancy? Here is the command I used:
Thank you for the help : )