gvbarroso / iSMC

The integrated Sequentially Markovian Coalescent
GNU General Public License v3.0
11 stars 3 forks source link

Number of observed states in fragment 2 does not match number of observed states in entire sequence #7

Closed A-J-F-Mackintosh closed 3 years ago

A-J-F-Mackintosh commented 3 years ago

Hi Gustavo,

I have been trying out iSMC (the precompiled release, 0.0.23) on some insect data. One dataset ran from start to finish successfully, however with another I can't seem to start the optimisation step.

Pre-processing pair of genomes SP_BI_364...
    creating data structure for fragment #1 of 349.
    creating data structure for fragment #2 of 349.
WARNING!!! Number of observed states in fragment 2 does not match number of observed states in entire sequence!
This usually happens if fragment is small and lacks either homozygous,heterozygous or missing sites.
Fragment size: 1000000
terminate called after throwing an instance of 'bpp::Exception'
    what():  iSMC::Could not create data structure with zipHMM!
Aborted (core dumped)

When I look at the sequences that iSMC has written (SP_BI_364.block.*.fasta.gz), most of them have no missing sites masked even though they all should. So, I think there is a problem with the masking step but I am unsure exactly what might be going wrong. Have you seen this behaviour before?

I did think that compiling a newer iSMC version from this repo would be a good idea, as maybe the problem is exclusive to v0.0.23. Unfortunately I am having trouble installing simple-ziphmm.

git clone https://gitlab.gwdg.de/molsysevol/simple-ziphmm.git cd simple-ziphmm/ cmake -DCMAKE_INSTALL_PREFIX=~/software . make install

Leads to:

[  3%] Building CXX object src/CMakeFiles/simpleziphmm-shared.dir/build_forwarder.cpp.o
In file included from /ceph/users/amackintosh/software/simple-ziphmm/src/forwarder.hpp:4:0,
    from /ceph/users/amackintosh/software/simple-ziphmm/src/build_forwarder.cpp:1:
/ceph/users/amackintosh/software/simple-ziphmm/src/matrix.hpp:19:21: fatal error: cblas.h: No such file or directory
    #include "cblas.h"
        ^
compilation terminated.
src/CMakeFiles/simpleziphmm-shared.dir/build.make:81: recipe for target 'src/CMakeFiles/simpleziphmm-shared.dir/build_forwarder.cpp.o' failed
make[2]: *** [src/CMakeFiles/simpleziphmm-shared.dir/build_forwarder.cpp.o] Error 1
CMakeFiles/Makefile2:954: recipe for target 'src/CMakeFiles/simpleziphmm-shared.dir/all' failed
make[1]: *** [src/CMakeFiles/simpleziphmm-shared.dir/all] Error 2
Makefile:181: recipe for target 'all' failed
make: *** [all] Error 2

I use conda to manage software, and so tried to install cblas through https://anaconda.org/anaconda/openblas as well https://anaconda.org/conda-forge/libcblas but 'make' still can't find a cblas.h file.

Do you have any suggestions on how I might be able to fix the masking problem I'm having? Or maybe it is best to try and compile simple-ziphmm another way and use the most recent iSMC version?

Best wishes,

Alex

gvbarroso commented 3 years ago

Dear Alex,

Indeed, if you cannot see masked sites in ALL the sequences in the fasta files that iSMC outputs, that is most likely where the problem is coming from. This is odd. What type of input files are you using? If VCF, are you providing a mask file (BED of FASTA format)?

I also notice that you have specified many independent fragments in your data. What were your criteria? For example, 1000000 seems like too round of a number to happen randomly, but who knows. Anyway, it could be useful to look at the length distribution of your fragments. If some are too small, they may simply lack heterozygous sites by chance...

It is always a good idea to use the latest version of iSMC, but as for your issue with cblas, I am afraid I cannot help much with that (I am not familiar with conda). Maybe @jydu would have some pointers here?

A-J-F-Mackintosh commented 3 years ago

Dear Gustavo,

After reading your response I realised that my tab file was the cause of the problem. Instead of having one line per chromosome I had split chromosomes up into 1Mb windows. E.g.

ctg_1    0    1000000    0    1000000    0    1000000
ctg_1    1000000    2000000    0    1000000    1000000    2000000

I have fixed this so now iSMC is up and running.

I have noticed that chromosomes cannot be listed in their entirety in the tab file. For example, the following causes a segmentation fault.

ctg_1    0    17344908    0    17344908    0    17344908
ctg_2    0    2117830    0    2117830    0    2117830

Whereas a buffer of 10kb at the start and end is allowed. E.g.

ctg_1    10000    17334908    0    17324908    10000    17334908
ctg_2    10000    2107830    0    2097830    10000    2107830

Is this because the start/end of a block must be with the first/last SNP of that chromosome in the vcf? Or can a block not start with masked sites?

Also, any advice on compiling simple-ziphmm / iSMC with conda dependencies would be really helpful

Best,

Alex

gvbarroso commented 3 years ago

Dear Alex,

I am glad you could spot that. As a rule,you should not split chromosomes unless you have a very good reason for it (in which case the split should be reflected both in the tab file and in the CHROM column of the VCF). As for your tab file, you are guessing it correctly. The 2nd column should be the first SNP in your VCF (not 0) and the 3rd column should be the last SNP in your VCF (not the chr length - 1). This is why your first example fails while the buffer fixes the problem (your VCF most likely starts at a position < 10kb). The mask file, on the other hand, can include more sites.

Unfortunately I am not familar with conda (at all) and cannot help with the installation issue.

Best, Gustavo