gvbarroso / iSMC

The integrated Sequentially Markovian Coalescent
GNU General Public License v3.0
11 stars 3 forks source link

Something went wrong when computing theta #8

Closed xinghua1001 closed 3 years ago

xinghua1001 commented 3 years ago

Hi Gustavo, It's awesome to infer recombination maps from a single pair of genomes. I tried to do so with iSMC using a single unphased vcf file. Sadly it showed something wrong, below is the log:

Parsing options: Parsing file test_opt.bpp for options. WARNING!!! Parameter diploid_indices not specified. Default used instead: (0,1) Opening sequence file: SRR6906436.LG3.snps.vcf.gz Done. Parsing Tab file...done. Parsing VCF file... done. Masking sequences... done. Creating iSMC objects... done. Setting up optimisation. terminate called after throwing an instance of 'bpp::Exception' what(): iSMC::Something went wrong when computing theta :( /opt/ha/gridview/slurm/spool/slurmd/job6516434/slurm_script: line 10: 81459 Aborted (core dumped) ismc params=test_opt.bpp

gvbarroso commented 3 years ago

Hello!

There are a few reasons why this error may appear (thus I couldn't make it more informative so far, unfortunately), but the most common has to do with the of observed states of the model. iSMC expects to see at least one homozygous position, at least one heterozygous position and optionally masked positions, set using some criterion by the researcher. If at least one site is masked in a chromosome, then iSMC will expect to see at least one masked site in each chromosome (or contig or whatever is set in the first column of the VCF). Could you please check whether something related to this could be causing the problem? If more individuals are present, this generalises in that dimension as well, but this should not apply to your case. Other times an extra (empty) line in the Tab file will also result in error.

Please let me know if you manage to dig a little deeper.

jydu commented 3 years ago

Gustavo,

Got the same error here, most likely due to "missing" missing data in at least one pair of chromosomes. Maybe a simple trick would be to systematically set the first and last positions of every chromosome alignments as N, to avoid this issue and with virtually no impact on the results?

J.

gvbarroso commented 3 years ago

Yes, I think this should work!

gvbarroso commented 3 years ago

Added a method to do just that, please let me know if it works!

jydu commented 3 years ago

Hi Gustavo,

Great! I think this will be useful, but the theta issue might be somewhere else. As far as I can see, it occurs when there is no heterozygous site, independently of the presence or not of missing data. We happened to have that exact same message because we mistakenly specified the list of haplotypes by their name instead of their index. The program then converted the characters into integers, which was 0 for all. As a result, we were comparing the first haplotype with itself! I have added a test that sends an error when two identical haplotypes are compared.

Cheers,

Julien.