popgenmethods / SINGER

Sampling and inference of genealogies with recombination
MIT License
24 stars 4 forks source link

Different numbers of alleles in input VCF and output treeSequence files #8

Closed dandanpeng closed 7 months ago

dandanpeng commented 7 months ago

Hi, @YunDeng98,

I have three questions about the usage of SINGER:

  1. I have simulated a very short sequence (2Kb) and want to run the ARG inference for the entire sequence instead of a specific region. Would it be OK to call singer_master and set the -start and -end parameters as 0 and 2e5, respectively? My rationale for this is that, for such a short sequence, cutting it into windows and parallelizing the running (as parrallel_singer does) may be unnecessary.

  2. My VCF data comprises 10 individuals, each with 2 alleles for one SNP. I expected that each marginal tree in the output tree sequence would have 10 tips (each tip corresponds to 1 individual) or 20 tips (each tip corresponds to 1 chromosome). However, I found that there are 40 tips in each marginal tree. I inspected the variants in the tree sequence and observed that the first 20 alleles correspond to the alleles shown in the VCF file, and all the latter 20 alleles are 0, indicating potential auto-filling with 0. Any insights into what might be causing this discrepancy? Here is the command I used:

    singer_master -Ne 10000 -m 2e-8 -ratio 1.25 -start 0 -end 2e5 -vcf singer -output singer -n 700 -thin 10 -polar 0.99
    convert_to_tskit -input singer -output singer -start 200 -end 699 -step 10
This is how the VCF and variant in tree sequence look like: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8 Sample 9 Sample 10
1 199680 snp505 A C . . . GT 0|0 1|1 1|1 1|1 0|1 0|1 1|0 1|0 0|0 1|1

Variable site 504 at genome position 199680.0 : ['0', '0', '1', '1', '1', '1', '1', '1', '0', '1', '0', '1', '1', '0', '1', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']

  1. Could you provide a guidance on the number of iterations and burn-in? Débora Brandt's evaluation shows that running ARGweaver with 1200 iterations with burn-in of the first 200 iterations is sufficient for convergence. Is there a special reason behind running SINGER for 10,000 iterations with burn-in of the first 4,000 iterations?

Thank you for the help : )

YunDeng98 commented 7 months ago

Hi @dandanpeng here are my thoughts on your questions:

(1) Here you mention 200kb since you use -end 2e5? It is definitely no problem to use singer_master directly instead of parallel_singer. One thing to note is that SINGER needs at least >100 mutations in the region to work, otherwise you probably don't want to trust the results anyways.

(2) The VCF file you provided looks like it is haploid VCF file, instead of diploid. It is supposed to look like Sample 1 with 0|1 to work for SINGER. But here Sample 1 appears only with 0. I think I will write a new parser for haploid VCF later, but before that happens, you can change the simulation to output diploid VCF.

(3) Ah the number of burn-in and thinning. It is rather subtle, and heavily depends on the sample size (My real data is much larger than Debora's test simulation data). The reason for Debora's setting is mainly the runtime constraints for running ARGweaver. I think for <50 haplotypes it is more than enough for run 2000 iterations with 1000 as burn-in (from my own experience).

Let me know if I mis-interpret any of your questions or you have any follow-up questions!

dandanpeng commented 7 months ago

Thank you for the explanation!

For the second question, sorry I messed up the table with the pipe in MARKDOWN. I did use the diploid data and just fixed the input in the original post. And now you can see that the correspondence between the VCF file and variants in tree sequence.

YunDeng98 commented 7 months ago

Can you share the vcf file with me? Or maybe show 20 lines or so in the markdown?

dandanpeng commented 7 months ago

Sure! Here is the vcf file I ran with. Since GitHub doesn't support the VCF file type, I changed the suffix to TXT, hope it still works on your side. singer.txt

YunDeng98 commented 7 months ago

Oh I know what is wrong! The program interprets Sample 1 as 2 names, one for Sample and another for 1. Could you try naming them Sample_1, Sample_2 instead?

dandanpeng commented 7 months ago

That solves the issue! Thank you for the help!