insilico / encore

GWAS and biological data analysis tool
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

A bug about function 'readNumFile' #1

Open biosyssun opened 12 years ago

biosyssun commented 12 years ago

compilation output:

g++ -DHAVE_CONFIG_H -I. -DUNIX -I/share/disk6-2/leihx/sunjy/software/boost-1.45.0/include -DHAVEBOOL -DNOPLUGIN_ -fopenmp -O3 -MT encore.o -MD -MP -MF .deps/encore.Tpo -c -o encore.o encore.cpp encore.cpp: In function 'int main(int, char*)': encore.cpp:481:40: error: no matching function for call to 'PlinkHandler::readNumFile(std::string&, bool&)' /share/disk6-2/leihx/sunjy/software/plink-1.07/include/plink/plinklibhandler.h:32:8: note: candidate is: void PlinkHandler::readNumFile(std::string) make[1]: _\ [encore.o] Error 1 make[1]: Leaving directory `/share/disk6-2/leihx/sunjy/tools/insilico-encore-ad4960d' make: *\ [all] Error 2

what is wrong?

hexhead commented 12 years ago

Looks like PLINK 1.07 is being used instead of our specialized version here:

https://github.com/insilico/plink

biosyssun commented 12 years ago

thank you very much for your quick response. I will try it now!

biosyssun commented 12 years ago

I succeeded to install Encore on our compute cluster with CentOS5.2,

(A) The compilation environment is the following: GCC-4.5.0 Encore1.1 https://github.com/insilico/Encore EC: https://github.com/insilico/EC Plink1.07-is2: https://github.com/insilico/plink Random jungle: https://github.com/insilico/randomjungle GNU gsl-1.15:http://mirrors.ustc.edu.cn/gnu/gsl/ BLAS: http://www.netlib.org/blas/blas.tgz LAPACK:http://www.netlib.org/lapack/lapack-3.4.1.tgz Armadillo:http://sourceforge.net/projects/arma/files/armadillo-3.4.0.tar.gz Boost-1.45.0:http://sourceforge.net/projects/boost/files/boost/1.45.0/

(B)Installation: step1: install blas

  1. vi make.inc to change the following; OPTS = -O2 -fPIC NOOPT = -O0 -fPIC BLASLIB = libblas.a
  2. make

    step2: install lapack

  3. cp make.inc.example make.inc vi make.inc to change the following: OPTS = -O2 -fPIC NOOPT = -O0 -fPIC BLASLIB = /share/disk6-2/leihx/sunjy/tools/BLAS/libblas.a(path/to/libblas.a)
  4. make

    step 3: install armadillo

  5. export LIBS="-lblas -llapack -lgfortran"
  6. cmake .
  7. make
  8. make install DESTDIR=/share/disk6-2/leihx/sunjy/software/armadillo-3.4.0 Test of armadillo:

    1. cd examples
    2. g++ -I /share/disk6-2/leihx/sunjy/software/boost-1.45.0/include -L/share/disk6-2/leihx/sunjy/tools/BLAS -L/share/disk6-2/leihx/sunjy/tools/lapack-3.4.1 -L/share/disk6-2/leihx/sunjy/software/armadillo-3.4.0/usr/lib64 -O2 -o example1 example1.cpp -larmadillo -lgfortran -lblas -llapack if there is not any output, armadillo is succeed to be installed.

    step4: install boost

  9. ./bootstrap.sh --prefix=/share/disk6-2/leihx/sunjy/software/boost-1.45.0
  10. ./bjam install step5: install plink
  11. export LDFLAGS="-L/share/disk6-2/leihx/sunjy/tools/BLAS -L/share/disk6-2/leihx/sunjy/tools/lapack-3.4.1"
  12. export LIBS="-lblas -llapack -lgfortran"
  13. ./bootstrap.sh
  14. ./configure --prefix=/share/disk6-2/leihx/sunjy/software/plink-1.07 --with-openmp --with-lapack
  15. make
  16. make install

    step6: install random jungle

  17. export CPLUS_INCLUDE_PATH="/share/disk6-2/leihx/sunjy/tools/insilico-randomjungle-ca3da2a/src/library:/share/disk6-2/leihx/sunjy/tools/insilico-randomjungle-ca3da2a/src/lr:/share/disk6-2/leihx/sunjy/software/boost-1.45.0/include:/share/disk6-2/leihx/sunjy/software/gsl-1.15/include:$CPLUS_INCLUDE_PATH"
  18. export C_INCLUDE_PATH="/share/disk6-2/leihx/sunjy/tools/insilico-randomjungle-ca3da2a/src/library:/share/disk6-2/leihx/sunjy/tools/insilico-randomjungle-ca3da2a/src/lr:/share/disk6-2/leihx/sunjy/software/boost-1.45.0/include:/share/disk6-2/leihx/sunjy/software/gsl-1.15/include:$C_INCLUDE_PATH$"
  19. export LDFLAGS="-L/share/disk6-2/leihx/sunjy/software/gsl-1.15/lib -L/share/disk6-2/leihx/sunjy/software/boost-1.45.0/lib" export LIBS="-lgsl -lgslcblas"

    1. ./configure --prefix=/share/disk6-2/leihx/sunjy/software/randomjungle-1.2.362 --with-boost=/share/biosoft/software/boost1.45 --with-gsl-prefix=/share/disk6-2/leihx/sunjy/software/gsl-1.15 --with-xml-prefix=/share/biosoft/software/libxml2
    2. make
    3. make install

    step7: install EC

  20. export CPLUS_INCLUDE_PATH="/share/disk6-2/leihx/sunjy/software/boost-1.45.0/include:/share/disk6-2/leihx/sunjy/tools/insilico-EC-f6a909c/src/library:/share/disk6-2/leihx/sunjy/software/gsl-1.15/include:/share/disk6-2/leihx/sunjy/software/randomjungle-1.2.362/include:$CPLUS_INCLUDE_PATH"
  21. export C_INCLUDE_PATH="/share/disk6-2/leihx/sunjy/software/boost-1.45.0/include:/share/disk6-2/leihx/sunjy/tools/insilico-EC-f6a909c/src/library:/share/disk6-2/leihx/sunjy/software/gsl-1.15/include:/share/disk6-2/leihx/sunjy/software/randomjungle-1.2.362/include:$C_INCLUDE_PATH" export LDFLAGS="-L/share/disk6-2/leihx/sunjy/software/boost-1.45.0/lib -L/share/disk6-2/leihx/sunjy/tools/insilico-EC-f6a909c/src/library"

    1. ./configure --prefix=/share/disk6-2/leihx/sunjy/software/EC --with-boost=/share/disk6-2/leihx/sunjy/software/boost-1.45.0 --with-gsl-prefix=/share/disk6-2/leihx/sunjy/software/gsl-1.15 --with-xml-prefix=/share/biosoft/software/libxml2
    2. make
    3. make install

    step8: install Encore

  22. export CPLUS_INCLUDE_PATH="/share/disk6-2/leihx/sunjy/software/armadillo-3.4.0/usr/include:/share/disk6-2/leihx/sunjy/software/plink-1.07/include:/share/disk6-2/leihx/sunjy/software/EC/include:/share/disk6-2/leihx/sunjy/software/boost-1.45.0/include:/share/disk6-2/leihx/sunjy/software/gsl-1.15/include:/share/disk6-2/leihx/sunjy/software/randomjungle-1.2.362/include:$CPLUS_INCLUDE_PATH"
  23. export C_INCLUDE_PATH="/share/disk6-2/leihx/sunjy/software/armadillo-3.4.0/usr/include:/share/disk6-2/leihx/sunjy/software/plink-1.07/include:/share/disk6-2/leihx/sunjy/software/EC/include:/share/disk6-2/leihx/sunjy/software/boost-1.45.0/include:/share/disk6-2/leihx/sunjy/software/gsl-1.15/include:/share/disk6-2/leihx/sunjy/software/randomjungle-1.2.362/include:$C_INCLUDE_PATH"
    1. export LIBS="-larmadillo -lblas -llapack -lgfortran -lgsl -lgslcblas -llr -lboost_program_options -lec"
    2. export LDFLAGS="-L/share/disk6-2/leihx/sunjy/software/boost-1.45.0/lib -L/share/disk6-2/leihx/sunjy/tools/BLAS -L/share/disk6-2/leihx/sunjy/tools/lapack-3.4.1 -L/share/disk6-2/leihx/sunjy/software/armadillo-3.4.0/usr/lib64 -L/share/disk6-2/leihx/sunjy/software/EC/lib -L/share/disk6-2/leihx/sunjy/software/plink-1.07/lib -L/share/disk6-2/leihx/sunjy/software/randomjungle-1.2.362/lib -L/share/disk6-2/leihx/sunjy/software/gsl-1.15/lib"
      1. ./bootstrap.sh
      2. ./configure --prefix=/share/disk6-2/leihx/sunjy/software/encore-1.1 --disable-gsltest --with-openmp --with-gsl-prefix=/share/disk6-2/leihx/sunjy/software/gsl-1.15
      3. make
      4. make install

(C) important environment variables in ~/.bashrc

  1. export LD_LIBRARY_PATH="/share/disk6-2/leihx/sunjy/tools/CBLAS/lib:/share/disk6-2/leihx/sunjy/tools/BLAS:/share/disk6-2/leihx/sunjy/tools/lapack-3.4.1:/share/disk6-2/leihx/sunjy/software/armadillo-3.4.0/usr/lib64:/share/disk6-2/leihx/sunjy/software/EC/lib:/share/disk6-2/leihx/sunjy/software/randomjungle-1.2.362/lib:/share/biosoft/software/gcc_install/gmp5.0.4/lib:/share/biosoft/software/gcc_install/ppl0.12/lib:/share/biosoft/software/gcc_install/cloog0.16/lib:/share/biosoft/software/gcc_install/gmp-5.0.1/lib:/share/biosoft/software/gcc_install/mpc-0.8.2/lib:/share/biosoft/software/gcc_install/mpfr-3.0.0/lib:/share/biosoft/software/gcc_install/gcc4.5/gcc/lib64:/share/biosoft/software/gcc_install/gcc4.7/lib64:/share/disk6-2/leihx/sunjy/software/boost-1.45.0/lib:/share/disk6-2/leihx/sunjy/software/gsl-1.15/lib:$LD_LIBRARY_PATH"
    2. export CMAKE_LIBRARY_PATH="$CMAKE_LIBRARY_PATH:/share/disk62/leihx/sunjy/tools/CBLAS/lib:

/share/disk6-2/leihx/sunjy/tools/BLAS:/share/disk6-2/leihx/sunjy/tools/lapack-3.4.1"

    3. export BOOST_ROOT="/share/disk6-2/leihx/sunjy/software/boost-1.45.0"

Good luck for all to build Encore!

hexhead commented 12 years ago

Excellent! Glad you got it all working.

argoneus commented 12 years ago

As it just so happens I updated plink to our latest modified version last night. So Encore and plink hosted on our organization page should build on Linux, Mac, and Windows (via MingW).

biosyssun commented 12 years ago

when I executed the command "encore -i GWAS_Statistics_plink.bed --ec -o phg000068_ec_out" on Linux, a error happened, that is "RandomJungle constructor: Unexpected condition.rj-num-trees should have a default"! Why?

All of the output is :

start-time Fri Sep 14 12:39:30 HKT 2012 Reading map (extended format) from [ GWAS_Statistics_plink.bim ] 410969 markers to be included from [ GWAS_Statistics_plink.bim ] Reading pedigree information from [ GWAS_Statistics_plink.fam ] 1577 individuals read from [ GWAS_Statistics_plink.fam ] 1577 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 799 cases, 778 controls and 0 missing 615 males, 962 females, and 0 of unspecified sex Reading genotype bitfile from [ GWAS_Statistics_plink.bed ] Detected that binary PED file is v1.00 SNP-major mode Before frequency and genotyping pruning, there are 410969 SNPs 1577 founders and 0 non-founders found 31929 heterozygous haploid genotypes; set to missing Writing list of heterozygous haploid genotypes to [ phg000068_ec_out.hh ] Total genotyping rate in remaining individuals is 0.994814 0 SNPs failed missingness test ( GENO > 1 ) 0 SNPs failed frequency test ( MAF < 0 ) After frequency and genotyping pruning, there are 410969 SNPs After filtering, 799 cases, 778 controls and 0 missing After filtering, 615 males, 962 females, and 0 of unspecified sex 20121409 - 12:39:56 - IDs are not needed for this analysis 20121409 - 12:39:56 - Dataset detection for SNP file [GWAS_Statistics_plink.bed] 20121409 - 12:39:56 - Plink binary 20121409 - 12:39:56 - Default SNP nearest neighbors distance metric: gm 20121409 - 12:39:56 - Default continuous distance metric: manhattan 20121409 - 12:39:56 - PlinkBinaryDataset loading 20121409 - 12:39:56 - Plink filename prefix for bim and bed files: GWAS_Statistics_plink 20121409 - 12:39:56 - Reading plink bim/attribute metadata from GWAS_Statistics_plink.bim 20121409 - 12:39:58 - There are 410969 attributes in the dataset 20121409 - 12:39:58 - Detecting class type from file: GWAS_Statistics_plink.fam 20121409 - 12:39:58 - Case-control phenotypes detected 20121409 - 12:39:58 - Reading plink binary fam file from GWAS_Statistics_plink.fam 20121409 - 12:39:58 - 1577 individuals read from the fam file. 20121409 - 12:40:00 - Reading plink attribute data from GWAS_Statistics_plink.bed 20121409 - 12:40:00 - Reading instance data in attribute-major mode 20121409 - 12:40:00 - Reading 395 bytes for each SNP column 20121409 - 12:40:34 - 10% 20121409 - 12:41:09 - 20% 20121409 - 12:41:43 - 30% 20121409 - 12:42:18 - 40% 20121409 - 12:42:52 - 50% 20121409 - 12:43:27 - 60% 20121409 - 12:44:01 - 70% 20121409 - 12:44:35 - 80% 20121409 - 12:45:10 - 90% 20121409 - 12:45:44 - 100% 20121409 - 12:45:44 - 100% decoded data set 20121409 - 12:45:45 - There are 1 instances in the data set 20121409 - 12:45:45 - There are 1 instances in the instance mask 20121409 - 12:45:45 - There are 1 classes in the data set 20121409 - 12:45:45 - Updating all level counts: 20121409 - 12:45:45 - 1/1 done 20121409 - 12:45:45 - Excluding monomorphic SNPs 20121409 - 12:45:45 - 0 SNPs excluded as monomorphic 20121409 - 12:45:45 - 1 instances remain after covariate/phenotype matching 20121409 - 12:45:45 - Dataset has: 20121409 - 12:45:45 - instances: 1 20121409 - 12:45:45 - SNPs: 410969 20121409 - 12:45:45 - classes: 1 20121409 - 12:45:45 - Data Set Class Index 20121409 - 12:45:45 - Index has [1] entries: 20121409 - 12:45:45 - 0: 1 20121409 - 12:45:45 - total elements: 410970 20121409 - 12:45:45 - 0 missing attribute values detected 20121409 - 12:45:45 - Total genotyping rate: 1 20121409 - 12:45:45 - 0 missing numeric values detected 20121409 - 12:45:45 - Evaporative Cooling initialization: 20121409 - 12:45:45 - EC is removing attributes until best 410969 remain 20121409 - 12:45:45 - Running EC in standard mode: Random Jungle + Relief-F 20121409 - 12:45:45 - EC will remove 0 attributes on first iteration 20121409 - 12:45:45 - 8 OpenMP processors available to EC 20121409 - 12:45:45 - EC will use 8 threads 20121409 - 12:45:45 - Initializing Random Jungle with ConfigMap end-time Fri Sep 14 12:45:45 HKT 2012

hexhead commented 12 years ago

That is strange. Only one instance/sample/individual is being loaded, like all the samples are being filtered out. Maybe Nick can hel p use here. He wrote ENCORE. I wrote EC.

biosyssun commented 12 years ago

Yes, It is very strange! Are you sure this error is not EC's business? Nick, help me,please!

hexhead commented 12 years ago

Nick is no longer with the lab, so I will have a look.

hexhead commented 12 years ago

Ok, this is a communication issue between the Encore and EC libraries. This easiest fix for now, since I'm not ready to push all my EC changes public, would be to edit the RandomJungle.cpp file in the EC project. Lines 215-219 should read:

if (GetConfigValue(configMap, "rj-num-trees", configValue)) {
    unsigned int numTrees = lexical_cast<unsigned int>(configValue);
    rjParams.ntree = numTrees;
} else {
    cout << Timestamp() << "Setting RandomJungle number of trees to 500" << endl;
    rjParams.ntree = 500;
}

I changed the else part from a fatal error to the default RandomJungle setting of 500. You will need to recompile EC (first) and for good measure recompile Encore. Sorry for the problem. Not sure how we missed this one.

biosyssun commented 12 years ago

I have modified RandomJungle.cpp by your advice. It is OK now! But I also meet another problem that is "/opt/torque/mom_priv/jobs/2407859.big.cluster.cn.SC: line 13: 7385 Segmentation fault encore -i GWAS_Statistics_plink.bed --ec -o phg000068_ec_out"

Maybe it is not caused by Encore but by our limited computer memroy with 16G. Any suggestion will be appreciated! thank you very much!

hexhead commented 12 years ago

What is the size of your .bed file? We have run data sets on the order of your data, with 4000 individuals and 300k SNPs. Is this the data you are trying to run?

20121409 - 12:39:58 - There are 410969 attributes in the dataset 20121409 - 12:39:58 - Detecting class type from file: GWAS_Statistics_plink.fam 20121409 - 12:39:58 - Case-control phenotypes detected 20121409 - 12:39:58 - Reading plink binary fam file from GWAS_Statistics_plink.fam 20121409 - 12:39:58 - 1577 individuals read from the fam file.

We run with 48 GB of RAM, so I'd say your are probably pushing the limits of 16 GB.

hexhead commented 12 years ago

You can always try running the standalone EC from the command line:

billwhite@isaac~/src/cppec$ ec Allowed options: --help produce help message --verbose verbose output --convert convert data set to data set - no ec -T [ --optimize-temp ] optimize coupling constant T -c [ --config-file ] arg read configuration options from file - command line overrides these -s [ --snp-data ] arg read SNP attributes from genotype filename: txt, ARFF, plink (map/ped, binary, raw) --snp-file-type arg Ignore file extension and use type: textwhitesp, wekaarff, plinkped, plinkbed, plinkraw, mayogeo, birdseed -n [ --numeric-data ] arg read continuous attributes from PLINK-style covar file -X [ --numeric-transform ] arg perform numeric transformation: normalize, standardize, zscore, log, sqrt -a [ --alternate-pheno-file ] arg specifies an alternative phenotype/class label file; one value per line -g [ --ec-algorithm-steps ] arg (=all) EC steps to run (all|me=main effects only|it=interaction effects only) --ec-me-algorithm arg (=rj) Main effects algorithm (rj|deseq) --ec-it-algorithm arg (=rf) Interaction effects algorithm (rf|rfseq) --ec-seq-algorithm-mode arg (=snr) Seq interaction algorithm mode (snr|tstat) --ec-seq-algorithm-s0 arg (=0.050000000000000003) Seq interaction algorithm s0 (0.0 <= s0 <= 1.0) -t [ --ec-num-target ] arg (=0) EC N_target - target number of attributes to keep -r [ --ec-iter-remove-n ] arg (=0) Evaporative Cooling number of attributes to remove per iteration -p [ --ec-iter-remove-percent ] arg Evaporative Cooling precentage of attributes to remove per iteration -O [ --out-dataset-filename ] arg write a new tab-delimited data set with EC filtered attributes -o [ --out-files-prefix ] arg (=ec_run) use prefix for all output files --snp-metric arg (=gm) metric for determining the difference between subjects (gm|am|nca|nca6) -B [ --snp-metric-nn ] arg (=gm) metric for determining the difference between subjects (gm|am|nca|nca6|km) -W [ --snp-metric-weights ] arg (=gm) metric for determining the difference between SNPs (gm|am|nca|nca6) -N [ --numeric-metric ] arg (=manhattan) metric for determining the difference between numeric attributes (manhattan=|euclidean) -R [ --rj-run-mode ] arg (=1) Random Jungle run mode: 1 (default=library call with memory I/O) / 2 (system call)/ 3 (library call with file I/O) -j [ --rj-num-trees ] arg (=500) Random Jungle number of trees to grow --rj-mtry arg Random Jungle size of randomly chosen variable sets, DEFAULT: sqrt(ncol) --rj-nimpvar arg (=100) Random Jungle only necessary if backsel>0. SIZE=[1-...] how many variable should remain --rj-impmeasure arg (=1) Random Jungle importance method (see RJ docs) --rj-backsel arg (=0) Random Jungle backward elimination (see RJ docs) -Y [ --rj-tree-type ] arg (=1) Random Jungle tree type: 1 (default)-5 (see RJ docs) -M [ --rj-memory-mode ] arg (=0) Random Jungle memory mode: 0 (default=double) / 1 (float) / 2 (char) --rj-rng-seed arg (=1) Seed for the random number generator. -x [ --snp-exclusion-file ] arg file of SNP names to be excluded -k [ --k-nearest-neighbors ] arg (=10) set k nearest neighbors -m [ --number-random-samples ] arg (=0) number of random samples (0=all|1 <= n <= number of samples) -b [ --weight-by-distance-method ] arg (=equal) weight-by-distance method (equal|one_over_k|exponential) --weight-by-distance-sigma arg (=2) weight by distance sigma -d [ --diagnostic-tests ] arg performs diagnostic tests and sends output to filename without running EC -D [ --diagnostic-levels-file ] arg write diagnostic attribute level counts to filename --dge-counts-data arg read digital gene expression counts from text file --dge-norm-factors arg read digital gene expression normalization factors from text file --birdseed-snps-data arg read SNP data from a birdseed formatted file --birdseed-phenos-data arg read birdseed subjects phenotypes from a text file --birdseed-subjects-labels arg read subject labels from filename to override names from data file --birdseed-include-snps arg include the SNP IDs listed in the text file --birdseed-exclude-snps arg exclude the SNP IDs listed the text file --distance-matrix arg create a distance matrix for the loaded samples and exit --gain-matrix arg create a GAIN matrix for the loaded samples and exit --dump-titv-file arg file for dumping SNP transition/transversion information

hexhead commented 12 years ago

For example:

ec -s GWAS_Statistics_plink.bed -o phg000068_ec_out

biosyssun commented 12 years ago

Yes, I run the data with 1577 individuals and 500K SNPs. EC requires much more RAM?

hexhead commented 12 years ago

The issue is the double RAM requirement for running EC from Encore. Encore and EC were developed independently; therefore, different data structures are used. Encore loads the data set, does any filtering or other manipulations, then passes a filename to EC. EC then loads the data into its data structures. If you run EC from the command line instead of Encore, you would half the RAM requirement and probably be fine. This is an unavoidable consequence of coupling libraries as loosely as possible. EC exists independently of Encore and vice versa-- this is a GOOD THING from our perspective. As a side note, EC depends on the Random Jungle library, which requires yet another copy of the data in RAM. I hope this helps explain the large memory requirements.

On Sat, Sep 15, 2012 at 6:39 AM, biosyssun notifications@github.com wrote:

Yes, I run the data with 1577 individuals and 500K SNPs. EC requires much more RAM?

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8583987.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 12 years ago

Good explanation! I will try to run EC. Thank you very much!

argoneus commented 12 years ago

Bill is correct that I'm no longer with the lab (I completed my PhD back in May), but I am certainly still interested and involved in Encore. I'll help whenever I can with issues.

biosyssun commented 12 years ago

I run "ec -s GWAS_Statistics_plink.bed -o phg000068_ec_out'. The erroe becaused of RAM was solved, but another issue happened.

ERROR: ERROR: GetNNearestInstances: N: [10] is larger than the number of neighbors in same class: 0 ERROR: relieff cannot get 10 nearest neighbors ERROR: RunReliefF: ComputeAttributeScores failed ERROR: In EC algorithm: ReliefF failed ERROR: Failed to calculate EC scores

OUTPUT: start-time Mon Sep 17 10:08:28 HKT 2012 20121709 - 10:08:28 - ec starting 20121709 - 10:08:28 - Processing command line arguments 20121709 - 10:08:28 - Determining analysis type 20121709 - 10:08:28 - SNP-only analysis requested 20121709 - 10:08:28 - Checking for numeric data and/or alternate phenotype files 20121709 - 10:08:28 - Determining the IDs to be read from the dataset 20121709 - 10:08:28 - IDs are not needed for this analysis 20121709 - 10:08:28 - 0 individual IDs read from numeric and/or phenotype file(s) 20121709 - 10:08:28 - Loading and preparing data for EC analysis 20121709 - 10:08:28 - Reading SNPs data set 20121709 - 10:08:28 - Dataset detection for SNP file [GWAS_Statistics_plink.bed] 20121709 - 10:08:28 - Plink binary 20121709 - 10:08:28 - Default SNP nearest neighbors distance metric: gm 20121709 - 10:08:28 - Default continuous distance metric: manhattan 20121709 - 10:08:28 - PlinkBinaryDataset loading 20121709 - 10:08:28 - Plink filename prefix for bim and bed files: GWAS_Statistics_plink 20121709 - 10:08:28 - Reading plink bim/attribute metadata from GWAS_Statistics_plink.bim 20121709 - 10:08:30 - There are 410969 attributes in the dataset 20121709 - 10:08:30 - Detecting class type from file: GWAS_Statistics_plink.fam 20121709 - 10:08:30 - Case-control phenotypes detected 20121709 - 10:08:30 - Reading plink binary fam file from GWAS_Statistics_plink.fam 20121709 - 10:08:30 - 1577 individuals read from the fam file. 20121709 - 10:08:32 - Reading plink attribute data from GWAS_Statistics_plink.bed 20121709 - 10:08:32 - Reading instance data in attribute-major mode 20121709 - 10:08:32 - Reading 395 bytes for each SNP column 20121709 - 10:09:08 - 10% 20121709 - 10:09:45 - 20% 20121709 - 10:10:22 - 30% 20121709 - 10:10:58 - 40% 20121709 - 10:11:35 - 50% 20121709 - 10:12:12 - 60% 20121709 - 10:12:48 - 70% 20121709 - 10:13:25 - 80% 20121709 - 10:14:01 - 90% 20121709 - 10:14:38 - 100% 20121709 - 10:14:38 - 100% decoded data set 20121709 - 10:14:38 - There are 1 instances in the data set 20121709 - 10:14:38 - There are 1 instances in the instance mask 20121709 - 10:14:38 - There are 1 classes in the data set 20121709 - 10:14:38 - Updating all level counts: 20121709 - 10:14:39 - 1/1 done 20121709 - 10:14:39 - Excluding monomorphic SNPs 20121709 - 10:14:39 - 0 SNPs excluded as monomorphic 20121709 - 10:14:39 - 1 instances remain after covariate/phenotype matching 20121709 - 10:14:39 - New SNP distance metric for nearest neighbors: gm 20121709 - 10:14:39 - New continuous distance metric: manhattan 20121709 - 10:14:39 - Dataset has: 20121709 - 10:14:39 - instances: 1 20121709 - 10:14:39 - SNPs: 410969 20121709 - 10:14:39 - classes: 1 20121709 - 10:14:39 - Data Set Class Index 20121709 - 10:14:39 - Index has [1] entries: 20121709 - 10:14:39 - 0: 1 20121709 - 10:14:39 - total elements: 410970 20121709 - 10:14:39 - 0 missing attribute values detected 20121709 - 10:14:39 - Total genotyping rate: 1 20121709 - 10:14:39 - 0 missing numeric values detected 20121709 - 10:14:39 - Running EC 20121709 - 10:14:39 - Evaporative Cooling initialization: 20121709 - 10:14:39 - EC is removing attributes until best 410969 remain 20121709 - 10:14:39 - Running EC in standard mode: Random Jungle + Relief-F 20121709 - 10:14:39 - EC will remove 0 attributes on first iteration 20121709 - 10:14:39 - 8 OpenMP processors available to EC 20121709 - 10:14:39 - EC will use 8 threads 20121709 - 10:14:39 - Initializing Random Jungle with Boost program options 20121709 - 10:14:39 - Using all 8 OpenMP processors available 20121709 - 10:14:39 - Initializing Relief-F 20121709 - 10:14:39 - Relief-F 20121709 - 10:14:39 - ReliefF initialization with boost command line parameters: 20121709 - 10:14:39 - Number of samples: m = 1 20121709 - 10:14:39 - Sampling all instances deterministically 20121709 - 10:14:39 - Number of nearest neighbors: k = 10 20121709 - 10:14:39 - ReliefF SNP weight update metric: gm 20121709 - 10:14:39 - ReliefF continuous distance weight update metric: manhattan 20121709 - 10:14:39 - Weight by distance method: equal 20121709 - 10:14:39 - 8 OpenMP processors available 20121709 - 10:14:39 - 1 OpenMP threads in work team 20121709 - 10:14:39 - ----------------------------------------------------------------------------- 20121709 - 10:14:39 - EC algorithm...iteration: 1, working attributes: 410969, target attributes: 410969, temperature: 1 20121709 - 10:14:39 - Ti/Tv: transitions: 410969 transversions: 0 ratio: 410969 20121709 - 10:14:39 - Running Random Jungle 20121709 - 10:14:39 - Computing Random Jungle variable importance scores 20121709 - 10:14:39 - Running Random Jungle using C++ librjungle calls 20121709 - 10:14:40 - Preparing Random Jungle type 1 20121709 - 10:14:40 - Loading RJ DataFrame with double values, 1 rows and 410970 columns. 20121709 - 10:14:40 - 1/1 20121709 - 11:40:38 - Running Random Jungle 20121709 - 11:40:42 - Loading RJ variable importance (VI) scores from [phg000068_ec_out.importance] 20121709 - 11:40:43 - Read [410969] scores from [phg000068_ec_out.importance], min [0.000], max [0.000] 20121709 - 11:40:43 - WARNING: Random Jungle min and max scores are the same 20121709 - 11:40:43 - RJ classification accuracy: 0.000 20121709 - 11:40:43 - Random Jungle finished in 5179.810 secs 20121709 - 11:40:43 - Running ReliefF 20121709 - 11:40:43 - Precomputing instance distances 20121709 - 11:40:43 - Allocating distance matrix done 20121709 - 11:40:43 - 1) Computing instance-to-instance distances... 20121709 - 11:40:43 - 1/1 done 20121709 - 11:40:43 - 2) Calculating same and different class nearest neighbors... 20121709 - 11:40:43 - 1/1 done 20121709 - 11:40:43 - 3) Calculating weight by distance factors for nearest neighbors... 20121709 - 11:40:43 - Freeing distance matrix memory done 20121709 - 11:40:43 - Running Relief-F algorithm 20121709 - 11:40:43 - Averaging factor 1/(m*k): 0.100000 end-time Mon Sep 17 11:40:43 HKT 2012

Why? Thanks!

hexhead commented 12 years ago

It looks to me like EC is not detecting the phenotypes in the .fam file or for some other reason filtering the instances (individuals) out. What does your .fam file look like?

On Sun, Sep 16, 2012 at 11:42 PM, biosyssun notifications@github.comwrote:

I run "ec -s GWAS_Statistics_plink.bed -o phg000068_ec_out'. The erroe becaused of RAM was solved, but another issue happened.

ERROR: ERROR: GetNNearestInstances: N: [10] is larger than the number of neighbors in same class: 0 ERROR: relieff cannot get 10 nearest neighbors ERROR: RunReliefF: ComputeAttributeScores failed ERROR: In EC algorithm: ReliefF failed ERROR: Failed to calculate EC scores

OUTPUT: start-time Mon Sep 17 10:08:28 HKT 2012 20121709 - 10:08:28 - ec starting 20121709 - 10:08:28 - Processing command line arguments 20121709 - 10:08:28 - Determining analysis type 20121709 - 10:08:28 - SNP-only analysis requested 20121709 - 10:08:28 - Checking for numeric data and/or alternate phenotype files 20121709 - 10:08:28 - Determining the IDs to be read from the dataset 20121709 - 10:08:28 - IDs are not needed for this analysis 20121709 - 10:08:28 - 0 individual IDs read from numeric and/or phenotype file(s) 20121709 - 10:08:28 - Loading and preparing data for EC analysis 20121709 - 10:08:28 - Reading SNPs data set 20121709 - 10:08:28 - Dataset detection for SNP file [GWAS_Statistics_plink.bed] 20121709 - 10:08:28 - Plink binary 20121709 - 10:08:28 - Default SNP nearest neighbors distance metric: gm 20121709 - 10:08:28 - Default continuous distance metric: manhattan 20121709 - 10:08:28 - PlinkBinaryDataset loading 20121709 - 10:08:28 - Plink filename prefix for bim and bed files: GWAS_Statistics_plink 20121709 - 10:08:28 - Reading plink bim/attribute metadata from GWAS_Statistics_plink.bim 20121709 - 10:08:30 - There are 410969 attributes in the dataset 20121709 - 10:08:30 - Detecting class type from file: GWAS_Statistics_plink.fam 20121709 - 10:08:30 - Case-control phenotypes detected 20121709 - 10:08:30 - Reading plink binary fam file from GWAS_Statistics_plink.fam 20121709 - 10:08:30 - 1577 individuals read from the fam file. 20121709 - 10:08:32 - Reading plink attribute data from GWAS_Statistics_plink.bed 20121709 - 10:08:32 - Reading instance data in attribute-major mode 20121709 - 10:08:32 - Reading 395 bytes for each SNP column 20121709 - 10:09:08 - 10% 20121709 - 10:09:45 - 20% 20121709 - 10:10:22 - 30% 20121709 - 10:10:58 - 40% 20121709 - 10:11:35 - 50% 20121709 - 10:12:12 - 60% 20121709 - 10:12:48 - 70% 20121709 - 10:13:25 - 80% 20121709 - 10:14:01 - 90% 20121709 - 10:14:38 - 100% 20121709 - 10:14:38 - 100% decoded data set 20121709 - 10:14:38 - There are 1 instances in the data set 20121709 - 10:14:38 - There are 1 instances in the instance mask 20121709 - 10:14:38 - There are 1 classes in the data set 20121709 - 10:14:38 - Updating all level counts: 20121709 - 10:14:39 - 1/1 done 20121709 - 10:14:39 - Excluding monomorphic SNPs 20121709 - 10:14:39 - 0 SNPs excluded as monomorphic 20121709 - 10:14:39 - 1 instances remain after covariate/phenotype matching 20121709 - 10:14:39 - New SNP distance metric for nearest neighbors: gm 20121709 - 10:14:39 - New continuous distance metric: manhattan 20121709 - 10:14:39 - Dataset has: 20121709 - 10:14:39 - instances: 1 20121709 - 10:14:39 - SNPs: 410969 20121709 - 10:14:39 - classes: 1 20121709 - 10:14:39 - Data Set Class Index 20121709 - 10:14:39 - Index has [1] entries: 20121709 - 10:14:39 - 0: 1 20121709 - 10:14:39 - total elements: 410970 20121709 - 10:14:39 - 0 missing attribute values detected 20121709 - 10:14:39 - Total genotyping rate: 1 20121709 - 10:14:39 - 0 missing numeric values detected 20121709 - 10:14:39 - Running EC 20121709 - 10:14:39 - Evaporative Cooling initialization: 20121709 - 10:14:39 - EC is removing attributes until best 410969 remain 20121709 - 10:14:39 - Running EC in standard mode: Random Jungle + Relief-F 20121709 - 10:14:39 - EC will remove 0 attributes on first iteration 20121709 - 10:14:39 - 8 OpenMP processors available to EC 20121709 - 10:14:39 - EC will use 8 threads 20121709 - 10:14:39 - Initializing Random Jungle with Boost program options 20121709 - 10:14:39 - Using all 8 OpenMP processors available 20121709 - 10:14:39 - Initializing Relief-F 20121709 - 10:14:39 - Relief-F 20121709 - 10:14:39 - ReliefF initialization with boost command line parameters: 20121709 - 10:14:39 - Number of samples: m = 1 20121709 - 10:14:39 - Sampling all instances deterministically 20121709 - 10:14:39 - Number of nearest neighbors: k = 10 20121709 - 10:14:39 - ReliefF SNP weight update metric: gm 20121709 - 10:14:39 - ReliefF continuous distance weight update metric: manhattan 20121709 - 10:14:39 - Weight by distance method: equal 20121709 - 10:14:39 - 8 OpenMP processors available 20121709 - 10:14:39 - 1 OpenMP threads in work team

20121709 - 10:14:39 -

20121709 - 10:14:39 - EC algorithm...iteration: 1, working attributes: 410969, target attributes: 410969, temperature: 1 20121709 - 10:14:39 - Ti/Tv: transitions: 410969 transversions: 0 ratio: 410969 20121709 - 10:14:39 - Running Random Jungle 20121709 - 10:14:39 - Computing Random Jungle variable importance scores 20121709 - 10:14:39 - Running Random Jungle using C++ librjungle calls 20121709 - 10:14:40 - Preparing Random Jungle type 1 20121709 - 10:14:40 - Loading RJ DataFrame with double values, 1 rows and 410970 columns. 20121709 - 10:14:40 - 1/1 20121709 - 11:40:38 - Running Random Jungle 20121709 - 11:40:42 - Loading RJ variable importance (VI) scores from [phg000068_ec_out.importance] 20121709 - 11:40:43 - Read [410969] scores from [phg000068_ec_out.importance], min [0.000], max [0.000] 20121709 - 11:40:43 - WARNING: Random Jungle min and max scores are the same 20121709 - 11:40:43 - RJ classification accuracy: 0.000 20121709 - 11:40:43 - Random Jungle finished in 5179.810 secs 20121709 - 11:40:43 - Running ReliefF 20121709 - 11:40:43 - Precomputing instance distances 20121709 - 11:40:43 - Allocating distance matrix done 20121709 - 11:40:43 - 1) Computing instance-to-instance distances... 20121709 - 11:40:43 - 1/1 done 20121709 - 11:40:43 - 2) Calculating same and different class nearest neighbors... 20121709 - 11:40:43 - 1/1 done 20121709 - 11:40:43 - 3) Calculating weight by distance factors for nearest neighbors... 20121709 - 11:40:43 - Freeing distance matrix memory done 20121709 - 11:40:43 - Running Relief-F algorithm 20121709 - 11:40:43 - Averaging factor 1/(m*k): 0.100000 end-time Mon Sep 17 11:40:43 HKT 2012

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8604260.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

hexhead commented 12 years ago

Does your data set contain cases and controls? Are they coded 1 and 2?

On Mon, Sep 17, 2012 at 2:50 PM, Bill White bill.c.white@gmail.com wrote:

It looks to me like EC is not detecting the phenotypes in the .fam file or for some other reason filtering the instances (individuals) out. What does your .fam file look like?

On Sun, Sep 16, 2012 at 11:42 PM, biosyssun notifications@github.comwrote:

I run "ec -s GWAS_Statistics_plink.bed -o phg000068_ec_out'. The erroe becaused of RAM was solved, but another issue happened.

ERROR: ERROR: GetNNearestInstances: N: [10] is larger than the number of neighbors in same class: 0 ERROR: relieff cannot get 10 nearest neighbors ERROR: RunReliefF: ComputeAttributeScores failed ERROR: In EC algorithm: ReliefF failed ERROR: Failed to calculate EC scores

OUTPUT: start-time Mon Sep 17 10:08:28 HKT 2012 20121709 - 10:08:28 - ec starting 20121709 - 10:08:28 - Processing command line arguments 20121709 - 10:08:28 - Determining analysis type 20121709 - 10:08:28 - SNP-only analysis requested 20121709 - 10:08:28 - Checking for numeric data and/or alternate phenotype files 20121709 - 10:08:28 - Determining the IDs to be read from the dataset 20121709 - 10:08:28 - IDs are not needed for this analysis 20121709 - 10:08:28 - 0 individual IDs read from numeric and/or phenotype file(s) 20121709 - 10:08:28 - Loading and preparing data for EC analysis 20121709 - 10:08:28 - Reading SNPs data set 20121709 - 10:08:28 - Dataset detection for SNP file [GWAS_Statistics_plink.bed] 20121709 - 10:08:28 - Plink binary 20121709 - 10:08:28 - Default SNP nearest neighbors distance metric: gm 20121709 - 10:08:28 - Default continuous distance metric: manhattan 20121709 - 10:08:28 - PlinkBinaryDataset loading 20121709 - 10:08:28 - Plink filename prefix for bim and bed files: GWAS_Statistics_plink 20121709 - 10:08:28 - Reading plink bim/attribute metadata from GWAS_Statistics_plink.bim 20121709 - 10:08:30 - There are 410969 attributes in the dataset 20121709 - 10:08:30 - Detecting class type from file: GWAS_Statistics_plink.fam 20121709 - 10:08:30 - Case-control phenotypes detected 20121709 - 10:08:30 - Reading plink binary fam file from GWAS_Statistics_plink.fam 20121709 - 10:08:30 - 1577 individuals read from the fam file. 20121709 - 10:08:32 - Reading plink attribute data from GWAS_Statistics_plink.bed 20121709 - 10:08:32 - Reading instance data in attribute-major mode 20121709 - 10:08:32 - Reading 395 bytes for each SNP column 20121709 - 10:09:08 - 10% 20121709 - 10:09:45 - 20% 20121709 - 10:10:22 - 30% 20121709 - 10:10:58 - 40% 20121709 - 10:11:35 - 50% 20121709 - 10:12:12 - 60% 20121709 - 10:12:48 - 70% 20121709 - 10:13:25 - 80% 20121709 - 10:14:01 - 90% 20121709 - 10:14:38 - 100% 20121709 - 10:14:38 - 100% decoded data set 20121709 - 10:14:38 - There are 1 instances in the data set 20121709 - 10:14:38 - There are 1 instances in the instance mask 20121709 - 10:14:38 - There are 1 classes in the data set 20121709 - 10:14:38 - Updating all level counts: 20121709 - 10:14:39 - 1/1 done 20121709 - 10:14:39 - Excluding monomorphic SNPs 20121709 - 10:14:39 - 0 SNPs excluded as monomorphic 20121709 - 10:14:39 - 1 instances remain after covariate/phenotype matching 20121709 - 10:14:39 - New SNP distance metric for nearest neighbors: gm 20121709 - 10:14:39 - New continuous distance metric: manhattan 20121709 - 10:14:39 - Dataset has: 20121709 - 10:14:39 - instances: 1 20121709 - 10:14:39 - SNPs: 410969 20121709 - 10:14:39 - classes: 1 20121709 - 10:14:39 - Data Set Class Index 20121709 - 10:14:39 - Index has [1] entries: 20121709 - 10:14:39 - 0: 1 20121709 - 10:14:39 - total elements: 410970 20121709 - 10:14:39 - 0 missing attribute values detected 20121709 - 10:14:39 - Total genotyping rate: 1 20121709 - 10:14:39 - 0 missing numeric values detected 20121709 - 10:14:39 - Running EC 20121709 - 10:14:39 - Evaporative Cooling initialization: 20121709 - 10:14:39 - EC is removing attributes until best 410969 remain 20121709 - 10:14:39 - Running EC in standard mode: Random Jungle + Relief-F 20121709 - 10:14:39 - EC will remove 0 attributes on first iteration 20121709 - 10:14:39 - 8 OpenMP processors available to EC 20121709 - 10:14:39 - EC will use 8 threads 20121709 - 10:14:39 - Initializing Random Jungle with Boost program options 20121709 - 10:14:39 - Using all 8 OpenMP processors available 20121709 - 10:14:39 - Initializing Relief-F 20121709 - 10:14:39 - Relief-F 20121709 - 10:14:39 - ReliefF initialization with boost command line parameters: 20121709 - 10:14:39 - Number of samples: m = 1 20121709 - 10:14:39 - Sampling all instances deterministically 20121709 - 10:14:39 - Number of nearest neighbors: k = 10 20121709 - 10:14:39 - ReliefF SNP weight update metric: gm 20121709 - 10:14:39 - ReliefF continuous distance weight update metric: manhattan 20121709 - 10:14:39 - Weight by distance method: equal 20121709 - 10:14:39 - 8 OpenMP processors available 20121709 - 10:14:39 - 1 OpenMP threads in work team

20121709 - 10:14:39 -

20121709 - 10:14:39 - EC algorithm...iteration: 1, working attributes: 410969, target attributes: 410969, temperature: 1 20121709 - 10:14:39 - Ti/Tv: transitions: 410969 transversions: 0 ratio: 410969 20121709 - 10:14:39 - Running Random Jungle 20121709 - 10:14:39 - Computing Random Jungle variable importance scores 20121709 - 10:14:39 - Running Random Jungle using C++ librjungle calls 20121709 - 10:14:40 - Preparing Random Jungle type 1 20121709 - 10:14:40 - Loading RJ DataFrame with double values, 1 rows and 410970 columns. 20121709 - 10:14:40 - 1/1 20121709 - 11:40:38 - Running Random Jungle 20121709 - 11:40:42 - Loading RJ variable importance (VI) scores from [phg000068_ec_out.importance] 20121709 - 11:40:43 - Read [410969] scores from [phg000068_ec_out.importance], min [0.000], max [0.000] 20121709 - 11:40:43 - WARNING: Random Jungle min and max scores are the same 20121709 - 11:40:43 - RJ classification accuracy: 0.000 20121709 - 11:40:43 - Random Jungle finished in 5179.810 secs 20121709 - 11:40:43 - Running ReliefF 20121709 - 11:40:43 - Precomputing instance distances 20121709 - 11:40:43 - Allocating distance matrix done 20121709 - 11:40:43 - 1) Computing instance-to-instance distances... 20121709 - 11:40:43 - 1/1 done 20121709 - 11:40:43 - 2) Calculating same and different class nearest neighbors... 20121709 - 11:40:43 - 1/1 done 20121709 - 11:40:43 - 3) Calculating weight by distance factors for nearest neighbors... 20121709 - 11:40:43 - Freeing distance matrix memory done 20121709 - 11:40:43 - Running Relief-F algorithm 20121709 - 11:40:43 - Averaging factor 1/(m*k): 0.100000 end-time Mon Sep 17 11:40:43 HKT 2012

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8604260.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 12 years ago

My .fam file is delimited by space and coded 1 and 2. some part of .fam file is:

0 1 0 0 2 2 0 3 0 0 2 2 0 4 0 0 2 2 0 5 0 0 2 2 0 6 0 0 1 2 0 7 0 0 2 2 0 8 0 0 1 2 0 9 0 0 1 2 0 10 0 0 1 2 0 11 0 0 1 2 0 12 0 0 2 2 0 13 0 0 2 2 0 14 0 0 2 2 0 15 0 0 2 2 0 16 0 0 2 2 0 17 0 0 2 2 0 18 0 0 1 2 0 19 0 0 2 2 0 20 0 0 1 2 0 21 0 0 2 2 0 22 0 0 1 2 0 23 0 0 2 2 0 24 0 0 2 2 0 25 0 0 2 2 0 26 0 0 1 2 0 2641 0 0 2 1 0 2642 0 0 2 1 0 2644 0 0 1 1 0 2645 0 0 2 1 0 2646 0 0 1 1 0 2647 0 0 2 1 0 2649 0 0 2 1 0 2650 0 0 1 1

hexhead commented 12 years ago

Can you try the attached test files just to make sure your version of EC is working.

On Mon, Sep 17, 2012 at 8:39 PM, biosyssun notifications@github.com wrote:

My .fam file is delimited by space and coded 1 and 2. some part of .fam file is:

0 1 0 0 2 2 0 3 0 0 2 2 0 4 0 0 2 2 0 5 0 0 2 2 0 6 0 0 1 2 0 7 0 0 2 2 0 8 0 0 1 2 0 9 0 0 1 2 0 10 0 0 1 2 0 11 0 0 1 2 0 12 0 0 2 2 0 13 0 0 2 2 0 14 0 0 2 2 0 15 0 0 2 2 0 16 0 0 2 2 0 17 0 0 2 2 0 18 0 0 1 2 0 19 0 0 2 2 0 20 0 0 1 2 0 21 0 0 2 2 0 22 0 0 1 2 0 23 0 0 2 2 0 24 0 0 2 2 0 25 0 0 2 2 0 26 0 0 1 2 0 2641 0 0 2 1 0 2642 0 0 2 1 0 2644 0 0 1 1 0 2645 0 0 2 1 0 2646 0 0 1 1 0 2647 0 0 2 1 0 2649 0 0 2 1 0 2650 0 0 1 1

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8636956.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

hexhead commented 12 years ago

Oh, I think I see the problem now. EC requires the FID and IID fields (the first two columns) to be unique. For example:

billwhite@isaac~/analysis/dataset_tests/discrete-discrete$ head testSnps.fam WTCCC125636 WTCCC125636 0 0 0 1 WTCCC125637 WTCCC125637 0 0 0 1 WTCCC125638 WTCCC125638 0 0 0 1 WTCCC125639 WTCCC125639 0 0 0 1 WTCCC125640 WTCCC125640 0 0 0 1 WTCCC125641 WTCCC125641 0 0 0 1 WTCCC125642 WTCCC125642 0 0 0 1 WTCCC125648 WTCCC125648 0 0 0 1 WTCCC125649 WTCCC125649 0 0 0 1 WTCCC125650 WTCCC125650 0 0 0 1 billwhite@isaac~/analysis/dataset_tests/discrete-discrete$

On Mon, Sep 17, 2012 at 8:39 PM, biosyssun notifications@github.com wrote:

My .fam file is delimited by space and coded 1 and 2. some part of .fam file is:

0 1 0 0 2 2 0 3 0 0 2 2 0 4 0 0 2 2 0 5 0 0 2 2 0 6 0 0 1 2 0 7 0 0 2 2 0 8 0 0 1 2 0 9 0 0 1 2 0 10 0 0 1 2 0 11 0 0 1 2 0 12 0 0 2 2 0 13 0 0 2 2 0 14 0 0 2 2 0 15 0 0 2 2 0 16 0 0 2 2 0 17 0 0 2 2 0 18 0 0 1 2 0 19 0 0 2 2 0 20 0 0 1 2 0 21 0 0 2 2 0 22 0 0 1 2 0 23 0 0 2 2 0 24 0 0 2 2 0 25 0 0 2 2 0 26 0 0 1 2 0 2641 0 0 2 1 0 2642 0 0 2 1 0 2644 0 0 1 1 0 2645 0 0 2 1 0 2646 0 0 1 1 0 2647 0 0 2 1 0 2649 0 0 2 1 0 2650 0 0 1 1

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8636956.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 12 years ago

Unique FID is not always avaliable. My dataset was download from NCBI dbGap. It looks the same as stored on dbGap. So, I think that requiring unique FID is not a good idea.

hexhead commented 12 years ago

The quickest fix is for you to change line 494 in PlinkBinaryDataset.cpp and recompile. To read the IID instead of FIID, it should read:

string ID = tokens[1]

I am not sure of all the ramifications of this change on the whole EC system, so this is only a patch to get it to work for you.

On Mon, Sep 17, 2012 at 9:52 PM, biosyssun notifications@github.com wrote:

Unique FID is not always avaliable. My dataset was download from NCBI dbGap. It looks the same as stored on dbGap. So, I think that requiring unique FID is not a good idea.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8638169.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

hexhead commented 12 years ago

I agree with you, and I have implemented your suggestion. The library now reads the PLINK-style individual IDs (IID) column rather than family IDs (FID) column and expects it to be unique. This ID scheme is also used to link in numeric attributes (for quantitative traits, alone or in addition to SNPs) and alternative phenotype files (like PLINK). I will push these changes up to GitHub when I get my latest research, which has required some additions to the library, completed and validated. Hopefully, the last patch idea I sent worked for you and will continue to do so until I can release the next updates.

On Mon, Sep 17, 2012 at 9:52 PM, biosyssun notifications@github.com wrote:

Unique FID is not always avaliable. My dataset was download from NCBI dbGap. It looks the same as stored on dbGap. So, I think that requiring unique FID is not a good idea.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8638169.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 12 years ago

Thank you very much! I will try to recompile EC by your suggestion. Wish that EC and Encore will be better because EC algorithm is a great idea to filter millions of SNPs.

biosyssun commented 12 years ago

issue about structure of source code: A directory named 'ec' must be created in /src/library and contains all header files of /src/library, because the path '/src/library/ec' is needed when compiling EC.

hexhead commented 12 years ago

That should not be the case. All you need to do is follow the standard GNU autotools approach to building the library in the top-level project directory:

$ ./bootstrap.sh $ ./configure $ make $ sudo make install

No other users on other systems have reported needing such manual intervention.

On Tue, Sep 18, 2012 at 8:56 PM, biosyssun notifications@github.com wrote:

issue about structure of source code: A directory named 'ec' must be created in /src/library and contains all header files of /src/library, because the path '/src/library/ec' is needed when compiling EC.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8676891.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 12 years ago

I am sorry that I didn't clearly report the issue which is relevant to compilation of examples because these examples will be compiled with EC. Of course, the errors for compilation of example can be ignored.

hexhead commented 12 years ago

Oh right. Yes, that is a known issue. Thanks for reminding me. If you 'make install' twice, it should work. The examples expect the header files in /usr/local/include/ec or whatever install PREFIX you used with 'configure'. The examples were a late add-on and never really got tested. I will fix this as soon as I can. Thanks again.

On Tue, Sep 18, 2012 at 10:21 PM, biosyssun notifications@github.comwrote:

I am sorry that I didn't clearly report the issue which is relevant to compilation of examples because these examples will be compiled with EC. Of course, the errors for compilation of example can be ignored.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8678149.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

hexhead commented 12 years ago

This issue is fixed and will be in the next release.

On Tue, Sep 18, 2012 at 10:28 PM, Bill White bill.c.white@gmail.com wrote:

Oh right. Yes, that is a known issue. Thanks for reminding me. If you 'make install' twice, it should work. The examples expect the header files in /usr/local/include/ec or whatever install PREFIX you used with 'configure'. The examples were a late add-on and never really got tested. I will fix this as soon as I can. Thanks again.

On Tue, Sep 18, 2012 at 10:21 PM, biosyssun notifications@github.comwrote:

I am sorry that I didn't clearly report the issue which is relevant to compilation of examples because these examples will be compiled with EC. Of course, the errors for compilation of example can be ignored.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8678149.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 12 years ago

does the current version of EC support parallel computing between cluster nodes?

hexhead commented 12 years ago

No. EC only supports OpenMP with multiple cores on shared memory machines.

On Wed, Sep 19, 2012 at 10:04 PM, biosyssun notifications@github.comwrote:

does the current version of EC support parallel computing between cluster nodes?

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8715429.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 11 years ago

I have run my data with 1577 individuals and 500K SNPs for 120 hours on computer with 8 processors(Intel(R) Xeon(R) CPU E5440 @ 2.83GHz) and 16G RAM. The use of CPU and RAM is 100% and 21%, respectively. Now, I want to known that how long EC will have to run. Adding a funciton to indicate the progress of EC will be convenient to monitor its computing status.

hexhead commented 11 years ago

You should see updates for every 100 instances/samples processed at various stages of the algorithms progress, unless you are running as part of a job scheduling system, in which case you will need to find a way to access the stdout of the running job. We have run 4806 samples for approximately 360,000 SNPs on a 12 core system with 48 GB RAM, and it takes about 18 hours.

On Sun, Sep 23, 2012 at 9:12 PM, biosyssun notifications@github.com wrote:

I have run my data with 1577 individuals and 500K SNPs for 120 hours on computer with 8 processors(Intel(R) Xeon(R) CPU E5440 @ 2.83GHz) and 16G RAM. The use of CPU and RAM is 100% and 21%, respectively. Now, I want to known that how long EC will have to run. Adding a funciton to indicate the progress of EC will be convenient to monitor its computing status.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8805850.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 11 years ago

What are the fuctions of the three files with extension .confusion, .confusion2 and .importance, which are all empty when EC is running?

hexhead commented 11 years ago

Those files are created by Random Jungle and get deleted after RJ finishes.

Can you send me the screen output of EC. It is very difficult to know what might be happening without more information. Are you running from the command line or through a job scheduler? Can you run a small test set? Troubleshooting on this large of a data set is going to be difficult.

On Mon, Sep 24, 2012 at 4:26 AM, biosyssun notifications@github.com wrote:

What are the fuctions of the three files with extension .confusion, .confusion2 and .importance, which are all empty when EC is running?

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8811716.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 11 years ago

EC was running through a job scheduler. In order to check stdout of ec, I have to stop EC because users can only access the standard output after their jobs finish. I have tested EC using your example3 dataset and no error happened! start-time Wed Sep 19 12:22:19 HKT 2012 20121909 - 12:22:19 - ec starting 20121909 - 12:22:19 - Processing command line arguments 20121909 - 12:22:19 - Determining analysis type 20121909 - 12:22:19 - SNP-only analysis requested 20121909 - 12:22:19 - Checking for numeric data and/or alternate phenotype files 20121909 - 12:22:19 - Determining the IDs to be read from the dataset 20121909 - 12:22:19 - IDs are not needed for this analysis 20121909 - 12:22:19 - 0 individual IDs read from numeric and/or phenotype file(s) 20121909 - 12:22:19 - Loading and preparing data for EC analysis 20121909 - 12:22:19 - Reading SNPs data set 20121909 - 12:22:19 - Dataset detection for SNP file [GWAS_Statistics_plink.bed] 20121909 - 12:22:19 - Plink binary 20121909 - 12:22:19 - Default SNP nearest neighbors distance metric: gm 20121909 - 12:22:19 - Default continuous distance metric: manhattan 20121909 - 12:22:19 - PlinkBinaryDataset loading 20121909 - 12:22:19 - Plink filename prefix for bim and bed files: GWAS_Statistics_plink 20121909 - 12:22:19 - Reading plink bim/attribute metadata from GWAS_Statistics_plink.bim 20121909 - 12:22:21 - There are 410969 attributes in the dataset 20121909 - 12:22:21 - Detecting class type from file: GWAS_Statistics_plink.fam 20121909 - 12:22:21 - Case-control phenotypes detected 20121909 - 12:22:21 - Reading plink binary fam file from GWAS_Statistics_plink.fam 20121909 - 12:22:21 - 1577 individuals read from the fam file. 20121909 - 12:22:23 - Reading plink attribute data from GWAS_Statistics_plink.bed 20121909 - 12:22:23 - Reading instance data in attribute-major mode 20121909 - 12:22:23 - Reading 395 bytes for each SNP column 20121909 - 12:22:58 - 10% 20121909 - 12:23:33 - 20% 20121909 - 12:24:08 - 30% 20121909 - 12:24:43 - 40% 20121909 - 12:25:19 - 50% 20121909 - 12:25:54 - 60% 20121909 - 12:26:29 - 70% 20121909 - 12:27:04 - 80% 20121909 - 12:27:39 - 90% 20121909 - 12:28:14 - 100% 20121909 - 12:28:14 - 100% decoded data set 20121909 - 12:28:15 - There are 1577 instances in the data set 20121909 - 12:28:15 - There are 1577 instances in the instance mask 20121909 - 12:28:15 - There are 2 classes in the data set 20121909 - 12:28:15 - Updating all level counts: 20121909 - 12:28:42 - 100/1577 20121909 - 12:29:09 - 200/1577 20121909 - 12:29:37 - 300/1577 20121909 - 12:30:06 - 400/1577 20121909 - 12:30:35 - 500/1577 20121909 - 12:31:04 - 600/1577 20121909 - 12:31:32 - 700/1577 20121909 - 12:32:01 - 800/1577 20121909 - 12:32:29 - 900/1577 20121909 - 12:32:58 - 1000/1577 20121909 - 12:33:26 - 1100/1577 20121909 - 12:33:55 - 1200/1577 20121909 - 12:34:23 - 1300/1577 20121909 - 12:34:52 - 1400/1577 20121909 - 12:35:20 - 1500/1577 20121909 - 12:35:42 - 1577/1577 done 20121909 - 12:35:42 - Excluding monomorphic SNPs 20121909 - 12:35:42 - 0 SNPs excluded as monomorphic 20121909 - 12:35:42 - 1577 instances remain after covariate/phenotype matching 20121909 - 12:35:42 - New SNP distance metric for nearest neighbors: gm 20121909 - 12:35:42 - New continuous distance metric: manhattan 20121909 - 12:35:42 - Dataset has: 20121909 - 12:35:42 - instances: 1577 20121909 - 12:35:42 - SNPs: 410969 20121909 - 12:35:42 - classes: 2 20121909 - 12:35:42 - Data Set Class Index 20121909 - 12:35:42 - Index has [2] entries: 20121909 - 12:35:42 - 0: 778 20121909 - 12:35:42 - 1: 799 20121909 - 12:35:42 - total elements: 648099690 20121909 - 12:35:42 - 0 missing attribute values detected 20121909 - 12:35:42 - Total genotyping rate: 1 20121909 - 12:35:42 - 0 missing numeric values detected 20121909 - 12:35:42 - Running EC 20121909 - 12:35:42 - Evaporative Cooling initialization: 20121909 - 12:35:42 - EC is removing attributes until best 410969 remain 20121909 - 12:35:42 - Running EC in standard mode: Random Jungle + Relief-F 20121909 - 12:35:42 - EC will remove 0 attributes on first iteration 20121909 - 12:35:42 - 8 OpenMP processors available to EC 20121909 - 12:35:42 - EC will use 8 threads 20121909 - 12:35:42 - Initializing Random Jungle with Boost program options 20121909 - 12:35:42 - Using all 8 OpenMP processors available 20121909 - 12:35:42 - Initializing Relief-F 20121909 - 12:35:42 - Relief-F 20121909 - 12:35:42 - ReliefF initialization with boost command line parameters: 20121909 - 12:35:42 - Number of samples: m = 1577 20121909 - 12:35:42 - Sampling all instances deterministically 20121909 - 12:35:42 - Number of nearest neighbors: k = 10 20121909 - 12:35:42 - ReliefF SNP weight update metric: gm 20121909 - 12:35:42 - ReliefF continuous distance weight update metric: manhattan 20121909 - 12:35:42 - Weight by distance method: equal 20121909 - 12:35:42 - 8 OpenMP processors available 20121909 - 12:35:42 - 1 OpenMP threads in work team 20121909 - 12:35:42 - ----------------------------------------------------------------------------- 20121909 - 12:35:42 - EC algorithm...iteration: 1, working attributes: 410969, target attributes: 410969, temperature: 1 20121909 - 12:35:42 - Ti/Tv: transitions: 410969 transversions: 0 ratio: 410969 20121909 - 12:35:42 - Running Random Jungle 20121909 - 12:35:42 - Computing Random Jungle variable importance scores 20121909 - 12:35:42 - Running Random Jungle using C++ librjungle calls 20121909 - 12:35:43 - Preparing Random Jungle type 1 20121909 - 12:35:43 - Loading RJ DataFrame with double values, 1577 rows and 410970 columns. 20121909 - 12:35:43 -

hexhead commented 11 years ago

Looks like EC ran for 13 minutes and stalled out loading the RJ library for running Random Jungle. The log below should show:

20121909 - 12:35:43 - Loading RJ DataFrame with double values, 1577 rows and 410970 columns.

20121909 - 12:35:43 - 100/1577 200/1577 .... until loaded, but for some reason it didn't even start loading the data. Not sure what's going on there. I have had some problems with random jungle on multicore macs but not linux, which I assume you are using. Let me think about what might be going on. While I'm doing that, can you please try to schedule a small test data set to make sure it can get through your system successfully before we try the big data again. Thanks.

Bill

On Mon, Sep 24, 2012 at 9:14 PM, biosyssun notifications@github.com wrote:

EC was running through a job scheduler. In order to check stdout of ec, I have to stop EC because there is no stdout file unless errors happen. EC seems to do little things during 140 hours! The stdout is following: start-time Wed Sep 19 12:22:19 HKT 2012 20121909 - 12:22:19 - ec starting 20121909 - 12:22:19 - Processing command line arguments 20121909 - 12:22:19 - Determining analysis type 20121909 - 12:22:19 - SNP-only analysis requested 20121909 - 12:22:19 - Checking for numeric data and/or alternate phenotype files 20121909 - 12:22:19 - Determining the IDs to be read from the dataset 20121909 - 12:22:19 - IDs are not needed for this analysis 20121909 - 12:22:19 - 0 individual IDs read from numeric and/or phenotype file(s) 20121909 - 12:22:19 - Loading and preparing data for EC analysis 20121909 - 12:22:19 - Reading SNPs data set 20121909 - 12:22:19 - Dataset detection for SNP file [GWAS_Statistics_plink.bed] 20121909 - 12:22:19 - Plink binary 20121909 - 12:22:19 - Default SNP nearest neighbors distance metric: gm 20121909 - 12:22:19 - Default continuous distance metric: manhattan 20121909 - 12:22:19 - PlinkBinaryDataset loading 20121909 - 12:22:19 - Plink filename prefix for bim and bed files: GWAS_Statistics_plink 20121909 - 12:22:19 - Reading plink bim/attribute metadata from GWAS_Statistics_plink.bim 20121909 - 12:22:21 - There are 410969 attributes in the dataset 20121909 - 12:22:21 - Detecting class type from file: GWAS_Statistics_plink.fam 20121909 - 12:22:21 - Case-control phenotypes detected 20121909 - 12:22:21 - Reading plink binary fam file from GWAS_Statistics_plink.fam 20121909 - 12:22:21 - 1577 individuals read from the fam file. 20121909 - 12:22:23 - Reading plink attribute data from GWAS_Statistics_plink.bed 20121909 - 12:22:23 - Reading instance data in attribute-major mode 20121909 - 12:22:23 - Reading 395 bytes for each SNP column 20121909 - 12:22:58 - 10% 20121909 - 12:23:33 - 20% 20121909 - 12:24:08 - 30% 20121909 - 12:24:43 - 40% 20121909 - 12:25:19 - 50% 20121909 - 12:25:54 - 60% 20121909 - 12:26:29 - 70% 20121909 - 12:27:04 - 80% 20121909 - 12:27:39 - 90% 20121909 - 12:28:14 - 100% 20121909 - 12:28:14 - 100% decoded data set 20121909 - 12:28:15 - There are 1577 instances in the data set 20121909 - 12:28:15 - There are 1577 instances in the instance mask 20121909 - 12:28:15 - There are 2 classes in the data set 20121909 - 12:28:15 - Updating all level counts: 20121909 - 12:28:42 - 100/1577 20121909 - 12:29:09 - 200/1577 20121909 - 12:29:37 - 300/1577 20121909 - 12:30:06 - 400/1577 20121909 - 12:30:35 - 500/1577 20121909 - 12:31:04 - 600/1577 20121909 - 12:31:32 - 700/1577 20121909 - 12:32:01 - 800/1577 20121909 - 12:32:29 - 900/1577 20121909 - 12:32:58 - 1000/1577 20121909 - 12:33:26 - 1100/1577 20121909 - 12:33:55 - 1200/1577 20121909 - 12:34:23 - 1300/1577 20121909 - 12:34:52 - 1400/1577 20121909 - 12:35:20 - 1500/1577 20121909 - 12:35:42 - 1577/1577 done 20121909 - 12:35:42 - Excluding monomorphic SNPs 20121909 - 12:35:42 - 0 SNPs excluded as monomorphic 20121909 - 12:35:42 - 1577 instances remain after covariate/phenotype matching 20121909 - 12:35:42 - New SNP distance metric for nearest neighbors: gm 20121909 - 12:35:42 - New continuous distance metric: manhattan 20121909 - 12:35:42 - Dataset has: 20121909 - 12:35:42 - instances: 1577 20121909 - 12:35:42 - SNPs: 410969 20121909 - 12:35:42 - classes: 2 20121909 - 12:35:42 - Data Set Class Index 20121909 - 12:35:42 - Index has [2] entries: 20121909 - 12:35:42 - 0: 778 20121909 - 12:35:42 - 1: 799 20121909 - 12:35:42 - total elements: 648099690 20121909 - 12:35:42 - 0 missing attribute values detected 20121909 - 12:35:42 - Total genotyping rate: 1 20121909 - 12:35:42 - 0 missing numeric values detected 20121909 - 12:35:42 - Running EC 20121909 - 12:35:42 - Evaporative Cooling initialization: 20121909 - 12:35:42 - EC is removing attributes until best 410969 remain 20121909 - 12:35:42 - Running EC in standard mode: Random Jungle + Relief-F 20121909 - 12:35:42 - EC will remove 0 attributes on first iteration 20121909 - 12:35:42 - 8 OpenMP processors available to EC 20121909 - 12:35:42 - EC will use 8 threads 20121909 - 12:35:42 - Initializing Random Jungle with Boost program options 20121909 - 12:35:42 - Using all 8 OpenMP processors available 20121909 - 12:35:42 - Initializing Relief-F 20121909 - 12:35:42 - Relief-F 20121909 - 12:35:42 - ReliefF initialization with boost command line parameters: 20121909 - 12:35:42 - Number of samples: m = 1577 20121909 - 12:35:42 - Sampling all instances deterministically 20121909 - 12:35:42 - Number of nearest neighbors: k = 10 20121909 - 12:35:42 - ReliefF SNP weight update metric: gm 20121909 - 12:35:42 - ReliefF continuous distance weight update metric: manhattan 20121909 - 12:35:42 - Weight by distance method: equal 20121909 - 12:35:42 - 8 OpenMP processors available 20121909 - 12:35:42 - 1 OpenMP threads in work team

20121909 - 12:35:42 -

20121909 - 12:35:42 - EC algorithm...iteration: 1, working attributes: 410969, target attributes: 410969, temperature: 1 20121909 - 12:35:42 - Ti/Tv: transitions: 410969 transversions: 0 ratio: 410969 20121909 - 12:35:42 - Running Random Jungle 20121909 - 12:35:42 - Computing Random Jungle variable importance scores 20121909 - 12:35:42 - Running Random Jungle using C++ librjungle calls 20121909 - 12:35:43 - Preparing Random Jungle type 1 20121909 - 12:35:43 - Loading RJ DataFrame with double values, 1577 rows and 410970 columns. 20121909 - 12:35:43 -

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8841433.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 11 years ago

The subset of GWAS_Statistics_plink is successfully completed, whch has 1577 samples and 15929snps.

start-time Tue Sep 25 11:12:12 HKT 2012 20122509 - 11:12:12 - ec starting 20122509 - 11:12:12 - Processing command line arguments 20122509 - 11:12:12 - Determining analysis type 20122509 - 11:12:12 - SNP-only analysis requested 20122509 - 11:12:12 - Checking for numeric data and/or alternate phenotype files 20122509 - 11:12:12 - Determining the IDs to be read from the dataset 20122509 - 11:12:12 - IDs are not needed for this analysis 20122509 - 11:12:12 - 0 individual IDs read from numeric and/or phenotype file(s) 20122509 - 11:12:12 - Loading and preparing data for EC analysis 20122509 - 11:12:12 - Reading SNPs data set 20122509 - 11:12:12 - Dataset detection for SNP file [phg000068_ribo_mito.bed] 20122509 - 11:12:12 - Plink binary 20122509 - 11:12:12 - Default SNP nearest neighbors distance metric: gm 20122509 - 11:12:12 - Default continuous distance metric: manhattan 20122509 - 11:12:12 - PlinkBinaryDataset loading 20122509 - 11:12:12 - Plink filename prefix for bim and bed files: phg000068_ribo_mito 20122509 - 11:12:12 - Reading plink bim/attribute metadata from phg000068_ribo_mito.bim 20122509 - 11:12:12 - There are 15929 attributes in the dataset 20122509 - 11:12:12 - Detecting class type from file: phg000068_ribo_mito.fam 20122509 - 11:12:12 - Case-control phenotypes detected 20122509 - 11:12:12 - Reading plink binary fam file from phg000068_ribo_mito.fam 20122509 - 11:12:12 - 1577 individuals read from the fam file. 20122509 - 11:12:12 - Reading plink attribute data from phg000068_ribo_mito.bed 20122509 - 11:12:12 - Reading instance data in attribute-major mode 20122509 - 11:12:12 - Reading 395 bytes for each SNP column 20122509 - 11:12:14 - 9% 20122509 - 11:12:15 - 19% 20122509 - 11:12:16 - 29% 20122509 - 11:12:18 - 39% 20122509 - 11:12:19 - 49% 20122509 - 11:12:20 - 59% 20122509 - 11:12:22 - 69% 20122509 - 11:12:23 - 79% 20122509 - 11:12:25 - 90% 20122509 - 11:12:26 - 99% 20122509 - 11:12:26 - 100% decoded data set 20122509 - 11:12:26 - There are 1577 instances in the data set 20122509 - 11:12:26 - There are 1577 instances in the instance mask 20122509 - 11:12:26 - There are 2 classes in the data set 20122509 - 11:12:26 - Updating all level counts: 20122509 - 11:12:26 - 100/1577 20122509 - 11:12:27 - 200/1577 20122509 - 11:12:28 - 300/1577 20122509 - 11:12:29 - 400/1577 20122509 - 11:12:30 - 500/1577 20122509 - 11:12:30 - 600/1577 20122509 - 11:12:31 - 700/1577 20122509 - 11:12:32 - 800/1577 20122509 - 11:12:33 - 900/1577 20122509 - 11:12:34 - 1000/1577 20122509 - 11:12:34 - 1100/1577 20122509 - 11:12:35 - 1200/1577 20122509 - 11:12:36 - 1300/1577 20122509 - 11:12:37 - 1400/1577 20122509 - 11:12:38 - 1500/1577 20122509 - 11:12:38 - 1577/1577 done 20122509 - 11:12:38 - Excluding monomorphic SNPs 20122509 - 11:12:38 - 0 SNPs excluded as monomorphic 20122509 - 11:12:38 - 1577 instances remain after covariate/phenotype matching 20122509 - 11:12:38 - New SNP distance metric for nearest neighbors: gm 20122509 - 11:12:38 - New continuous distance metric: manhattan 20122509 - 11:12:38 - Dataset has: 20122509 - 11:12:38 - instances: 1577 20122509 - 11:12:38 - SNPs: 15929 20122509 - 11:12:38 - classes: 2 20122509 - 11:12:38 - Data Set Class Index 20122509 - 11:12:38 - Index has [2] entries: 20122509 - 11:12:38 - 0: 778 20122509 - 11:12:38 - 1: 799 20122509 - 11:12:38 - total elements: 25121610 20122509 - 11:12:38 - 0 missing attribute values detected 20122509 - 11:12:38 - Total genotyping rate: 1 20122509 - 11:12:38 - 0 missing numeric values detected 20122509 - 11:12:38 - Running EC 20122509 - 11:12:38 - Evaporative Cooling initialization: 20122509 - 11:12:38 - EC is removing attributes until best 15929 remain 20122509 - 11:12:38 - Running EC in standard mode: Random Jungle + Relief-F 20122509 - 11:12:38 - EC will remove 0 attributes on first iteration 20122509 - 11:12:38 - 8 OpenMP processors available to EC 20122509 - 11:12:38 - EC will use 8 threads 20122509 - 11:12:38 - Initializing Random Jungle with Boost program options 20122509 - 11:12:38 - Using all 8 OpenMP processors available 20122509 - 11:12:38 - Initializing Relief-F 20122509 - 11:12:38 - Relief-F 20122509 - 11:12:38 - ReliefF initialization with boost command line parameters: 20122509 - 11:12:38 - Number of samples: m = 1577 20122509 - 11:12:38 - Sampling all instances deterministically 20122509 - 11:12:38 - Number of nearest neighbors: k = 10 20122509 - 11:12:38 - ReliefF SNP weight update metric: gm 20122509 - 11:12:38 - ReliefF continuous distance weight update metric: manhattan 20122509 - 11:12:38 - Weight by distance method: equal 20122509 - 11:12:38 - 8 OpenMP processors available 20122509 - 11:12:38 - 1 OpenMP threads in work team 20122509 - 11:12:38 - ----------------------------------------------------------------------------- 20122509 - 11:12:38 - EC algorithm...iteration: 1, working attributes: 15929, target attributes: 15929, temperature: 1 20122509 - 11:12:38 - Ti/Tv: transitions: 15929 transversions: 0 ratio: 15929 20122509 - 11:12:38 - Running Random Jungle 20122509 - 11:12:38 - Computing Random Jungle variable importance scores 20122509 - 11:12:38 - Running Random Jungle using C++ librjungle calls 20122509 - 11:12:38 - Preparing Random Jungle type 1 20122509 - 11:12:38 - Loading RJ DataFrame with double values, 1577 rows and 15930 columns. 20122509 - 11:12:38 - 100/1577 200/1577 300/1577 400/1577 500/1577 600/1577 700/1577 800/1577 900/1577 1000/1577 20122509 - 12:19:12 - 1100/1577 1200/1577 1300/1577 1400/1577 1500/1577 1577/1577 20122509 - 12:57:30 - Running Random Jungle 20122509 - 12:57:50 - Loading RJ variable importance (VI) scores from [phg000068_ribo_mito_ec.importance] 20122509 - 12:57:50 - Read [15929] scores from [phg000068_ribo_mito_ec.importance], min [0.000], max [0.268] 20122509 - 12:57:50 - RJ classification accuracy: 0.484 20122509 - 12:57:50 - Random Jungle finished in 6409.140 secs 20122509 - 12:57:50 - Running ReliefF 20122509 - 12:57:50 - Precomputing instance distances 20122509 - 12:57:50 - Allocating distance matrix done 20122509 - 12:57:50 - 1) Computing instance-to-instance distances... 20122509 - 12:58:04 - 100/1577 20122509 - 12:58:17 - 200/1577 20122509 - 12:58:29 - 300/1577 20122509 - 12:58:40 - 400/1577 20122509 - 12:58:50 - 500/1577 20122509 - 12:58:59 - 600/1577 20122509 - 12:59:07 - 700/1577 20122509 - 12:59:15 - 800/1577 20122509 - 12:59:21 - 900/1577 20122509 - 12:59:27 - 1000/1577 20122509 - 12:59:31 - 1100/1577 20122509 - 12:59:35 - 1200/1577 20122509 - 12:59:38 - 1300/1577 20122509 - 12:59:40 - 1400/1577 20122509 - 12:59:41 - 1500/1577 20122509 - 12:59:41 - 1577/1577 done 20122509 - 12:59:41 - 2) Calculating same and different class nearest neighbors... 20122509 - 12:59:41 - 100/1577 20122509 - 12:59:41 - 200/1577 20122509 - 12:59:41 - 300/1577 20122509 - 12:59:41 - 400/1577 20122509 - 12:59:41 - 500/1577 20122509 - 12:59:42 - 600/1577 20122509 - 12:59:42 - 700/1577 20122509 - 12:59:42 - 800/1577 20122509 - 12:59:42 - 900/1577 20122509 - 12:59:42 - 1000/1577 20122509 - 12:59:42 - 1100/1577 20122509 - 12:59:42 - 1200/1577 20122509 - 12:59:42 - 1300/1577 20122509 - 12:59:42 - 1400/1577 20122509 - 12:59:43 - 1500/1577 20122509 - 12:59:43 - 1577/1577 done 20122509 - 12:59:43 - 3) Calculating weight by distance factors for nearest neighbors... 20122509 - 12:59:43 - Freeing distance matrix memory done 20122509 - 12:59:43 - Running Relief-F algorithm 20122509 - 12:59:43 - Averaging factor 1/(m*k): 0.000063 20122509 - 12:59:44 - 100/1577 20122509 - 12:59:45 - 200/1577 20122509 - 12:59:46 - 300/1577 20122509 - 12:59:48 - 400/1577 20122509 - 12:59:49 - 500/1577 20122509 - 12:59:50 - 600/1577 20122509 - 12:59:51 - 700/1577 20122509 - 12:59:53 - 800/1577 20122509 - 12:59:54 - 900/1577 20122509 - 12:59:55 - 1000/1577 20122509 - 12:59:56 - 1100/1577 20122509 - 12:59:58 - 1200/1577 20122509 - 12:59:59 - 1300/1577 20122509 - 13:00:00 - 1400/1577 20122509 - 13:00:01 - 1500/1577 20122509 - 13:00:02 - 1577/1577 done 20122509 - 13:00:02 - Normalizing ReliefF scores to 0-1 20122509 - 13:00:02 - ReliefF finished in 907.3 secs 20122509 - 13:00:02 - Computing free energy 20122509 - 13:00:02 - Free energy calculations complete in 0.0 secs 20122509 - 13:00:02 - Removing the worst attributes 20122509 - 13:00:02 - EC algorithm ran for 1 iterations 20122509 - 13:00:02 - EC done 20122509 - 13:00:02 - Writing EC scores to [phg000068_ribo_mito_ec.ec] 20122509 - 13:00:02 - Clean up and shutdown 20122509 - 13:00:02 - Removing temporary RandomJungle files 20122509 - 13:00:02 - EC elapsed time 7342.5 secs 20122509 - 13:00:02 - ec done end-time Tue Sep 25 13:00:02 HKT 2012

hexhead commented 11 years ago

That is encouraging. So it will run. I wonder if we are hitting the RAM limitation again?

On Tue, Sep 25, 2012 at 12:24 AM, biosyssun notifications@github.comwrote:

The subset of GWAS_Statistics_plink is successfully completed, whch has 1577 samples and 15929snps.

start-time Tue Sep 25 11:12:12 HKT 2012 20122509 - 11:12:12 - ec starting 20122509 - 11:12:12 - Processing command line arguments 20122509 - 11:12:12 - Determining analysis type 20122509 - 11:12:12 - SNP-only analysis requested 20122509 - 11:12:12 - Checking for numeric data and/or alternate phenotype files 20122509 - 11:12:12 - Determining the IDs to be read from the dataset 20122509 - 11:12:12 - IDs are not needed for this analysis 20122509 - 11:12:12 - 0 individual IDs read from numeric and/or phenotype file(s) 20122509 - 11:12:12 - Loading and preparing data for EC analysis 20122509 - 11:12:12 - Reading SNPs data set 20122509 - 11:12:12 - Dataset detection for SNP file [phg000068_ribo_mito.bed] 20122509 - 11:12:12 - Plink binary 20122509 - 11:12:12 - Default SNP nearest neighbors distance metric: gm 20122509 - 11:12:12 - Default continuous distance metric: manhattan 20122509 - 11:12:12 - PlinkBinaryDataset loading 20122509 - 11:12:12 - Plink filename prefix for bim and bed files: phg000068_ribo_mito 20122509 - 11:12:12 - Reading plink bim/attribute metadata from phg000068_ribo_mito.bim 20122509 - 11:12:12 - There are 15929 attributes in the dataset 20122509 - 11:12:12 - Detecting class type from file: phg000068_ribo_mito.fam 20122509 - 11:12:12 - Case-control phenotypes detected 20122509 - 11:12:12 - Reading plink binary fam file from phg000068_ribo_mito.fam 20122509 - 11:12:12 - 1577 individuals read from the fam file. 20122509 - 11:12:12 - Reading plink attribute data from phg000068_ribo_mito.bed 20122509 - 11:12:12 - Reading instance data in attribute-major mode 20122509 - 11:12:12 - Reading 395 bytes for each SNP column 20122509 - 11:12:14 - 9% 20122509 - 11:12:15 - 19% 20122509 - 11:12:16 - 29% 20122509 - 11:12:18 - 39% 20122509 - 11:12:19 - 49% 20122509 - 11:12:20 - 59% 20122509 - 11:12:22 - 69% 20122509 - 11:12:23 - 79% 20122509 - 11:12:25 - 90% 20122509 - 11:12:26 - 99% 20122509 - 11:12:26 - 100% decoded data set 20122509 - 11:12:26 - There are 1577 instances in the data set 20122509 - 11:12:26 - There are 1577 instances in the instance mask 20122509 - 11:12:26 - There are 2 classes in the data set 20122509 - 11:12:26 - Updating all level counts: 20122509 - 11:12:26 - 100/1577 20122509 - 11:12:27 - 200/1577 20122509 - 11:12:28 - 300/1577 20122509 - 11:12:29 - 400/1577 20122509 - 11:12:30 - 500/1577 20122509 - 11:12:30 - 600/1577 20122509 - 11:12:31 - 700/1577 20122509 - 11:12:32 - 800/1577 20122509 - 11:12:33 - 900/1577 20122509 - 11:12:34 - 1000/1577 20122509 - 11:12:34 - 1100/1577 20122509 - 11:12:35 - 1200/1577 20122509 - 11:12:36 - 1300/1577 20122509 - 11:12:37 - 1400/1577 20122509 - 11:12:38 - 1500/1577 20122509 - 11:12:38 - 1577/1577 done 20122509 - 11:12:38 - Excluding monomorphic SNPs 20122509 - 11:12:38 - 0 SNPs excluded as monomorphic 20122509 - 11:12:38 - 1577 instances remain after covariate/phenotype matching 20122509 - 11:12:38 - New SNP distance metric for nearest neighbors: gm 20122509 - 11:12:38 - New continuous distance metric: manhattan 20122509 - 11:12:38 - Dataset has: 20122509 - 11:12:38 - instances: 1577 20122509 - 11:12:38 - SNPs: 15929 20122509 - 11:12:38 - classes: 2 20122509 - 11:12:38 - Data Set Class Index 20122509 - 11:12:38 - Index has [2] entries: 20122509 - 11:12:38 - 0: 778 20122509 - 11:12:38 - 1: 799 20122509 - 11:12:38 - total elements: 25121610 20122509 - 11:12:38 - 0 missing attribute values detected 20122509 - 11:12:38 - Total genotyping rate: 1 20122509 - 11:12:38 - 0 missing numeric values detected 20122509 - 11:12:38 - Running EC 20122509 - 11:12:38 - Evaporative Cooling initialization: 20122509 - 11:12:38 - EC is removing attributes until best 15929 remain 20122509 - 11:12:38 - Running EC in standard mode: Random Jungle + Relief-F 20122509 - 11:12:38 - EC will remove 0 attributes on first iteration 20122509 - 11:12:38 - 8 OpenMP processors available to EC 20122509 - 11:12:38 - EC will use 8 threads 20122509 - 11:12:38 - Initializing Random Jungle with Boost program options 20122509 - 11:12:38 - Using all 8 OpenMP processors available 20122509 - 11:12:38 - Initializing Relief-F 20122509 - 11:12:38 - Relief-F 20122509 - 11:12:38 - ReliefF initialization with boost command line parameters: 20122509 - 11:12:38 - Number of samples: m = 1577 20122509 - 11:12:38 - Sampling all instances deterministically 20122509 - 11:12:38 - Number of nearest neighbors: k = 10 20122509 - 11:12:38 - ReliefF SNP weight update metric: gm 20122509 - 11:12:38 - ReliefF continuous distance weight update metric: manhattan 20122509 - 11:12:38 - Weight by distance method: equal 20122509 - 11:12:38 - 8 OpenMP processors available 20122509 - 11:12:38 - 1 OpenMP threads in work team

20122509 - 11:12:38 -

20122509 - 11:12:38 - EC algorithm...iteration: 1, working attributes: 15929, target attributes: 15929, temperature: 1 20122509 - 11:12:38 - Ti/Tv: transitions: 15929 transversions: 0 ratio: 15929 20122509 - 11:12:38 - Running Random Jungle 20122509 - 11:12:38 - Computing Random Jungle variable importance scores 20122509 - 11:12:38 - Running Random Jungle using C++ librjungle calls 20122509 - 11:12:38 - Preparing Random Jungle type 1 20122509 - 11:12:38 - Loading RJ DataFrame with double values, 1577 rows and 15930 columns. 20122509 - 11:12:38 - 100/1577 200/1577 300/1577 400/1577 500/1577 600/1577 700/1577 800/1577 900/1577 1000/1577 20122509 - 12:19:12 - 1100/1577 1200/1577 1300/1577 1400/1577 1500/1577 1577/1577 20122509 - 12:57:30 - Running Random Jungle 20122509 - 12:57:50 - Loading RJ variable importance (VI) scores from [phg000068_ribo_mito_ec.importance] 20122509 - 12:57:50 - Read [15929] scores from [phg000068_ribo_mito_ec.importance], min [0.000], max [0.268] 20122509 - 12:57:50 - RJ classification accuracy: 0.484 20122509 - 12:57:50 - Random Jungle finished in 6409.140 secs 20122509 - 12:57:50 - Running ReliefF 20122509 - 12:57:50 - Precomputing instance distances 20122509 - 12:57:50 - Allocating distance matrix done 20122509 - 12:57:50 - 1) Computing instance-to-instance distances... 20122509 - 12:58:04 - 100/1577 20122509 - 12:58:17 - 200/1577 20122509 - 12:58:29 - 300/1577 20122509 - 12:58:40 - 400/1577 20122509 - 12:58:50 - 500/1577 20122509 - 12:58:59 - 600/1577 20122509 - 12:59:07 - 700/1577 20122509 - 12:59:15 - 800/1577 20122509 - 12:59:21 - 900/1577 20122509 - 12:59:27 - 1000/1577 20122509 - 12:59:31 - 1100/1577 20122509 - 12:59:35 - 1200/1577 20122509 - 12:59:38 - 1300/1577 20122509 - 12:59:40 - 1400/1577 20122509 - 12:59:41 - 1500/1577 20122509 - 12:59:41 - 1577/1577 done 20122509 - 12:59:41 - 2) Calculating same and different class nearest neighbors... 20122509 - 12:59:41 - 100/1577 20122509 - 12:59:41 - 200/1577 20122509 - 12:59:41 - 300/1577 20122509 - 12:59:41 - 400/1577 20122509 - 12:59:41 - 500/1577 20122509 - 12:59:42 - 600/1577 20122509 - 12:59:42 - 700/1577 20122509 - 12:59:42 - 800/1577 20122509 - 12:59:42 - 900/1577 20122509 - 12:59:42 - 1000/1577 20122509 - 12:59:42 - 1100/1577 20122509 - 12:59:42 - 1200/1577 20122509 - 12:59:42 - 1300/1577 20122509 - 12:59:42 - 1400/1577 20122509 - 12:59:43 - 1500/1577 20122509 - 12:59:43 - 1577/1577 done 20122509 - 12:59:43 - 3) Calculating weight by distance factors for nearest neighbors... 20122509 - 12:59:43 - Freeing distance matrix memory done 20122509 - 12:59:43 - Running Relief-F algorithm 20122509 - 12:59:43 - Averaging factor 1/(m*k): 0.000063 20122509 - 12:59:44 - 100/1577 20122509 - 12:59:45 - 200/1577 20122509 - 12:59:46 - 300/1577 20122509 - 12:59:48 - 400/1577 20122509 - 12:59:49 - 500/1577 20122509 - 12:59:50 - 600/1577 20122509 - 12:59:51 - 700/1577 20122509 - 12:59:53 - 800/1577 20122509 - 12:59:54 - 900/1577 20122509 - 12:59:55 - 1000/1577 20122509 - 12:59:56 - 1100/1577 20122509 - 12:59:58 - 1200/1577 20122509 - 12:59:59 - 1300/1577 20122509 - 13:00:00 - 1400/1577 20122509 - 13:00:01 - 1500/1577 20122509 - 13:00:02 - 1577/1577 done 20122509 - 13:00:02 - Normalizing ReliefF scores to 0-1 20122509 - 13:00:02 - ReliefF finished in 907.3 secs 20122509 - 13:00:02 - Computing free energy 20122509 - 13:00:02 - Free energy calculations complete in 0.0 secs 20122509 - 13:00:02 - Removing the worst attributes 20122509 - 13:00:02 - EC algorithm ran for 1 iterations 20122509 - 13:00:02 - EC done 20122509 - 13:00:02 - Writing EC scores to [phg000068_ribo_mito_ec.ec] 20122509 - 13:00:02 - Clean up and shutdown 20122509 - 13:00:02 - Removing temporary RandomJungle files 20122509 - 13:00:02 - EC elapsed time 7342.5 secs 20122509 - 13:00:02 - ec done end-time Tue Sep 25 13:00:02 HKT 2012

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8844002.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 11 years ago

If it is because of limited RAM, some notices given by ec is useful. My RAM is 16G. what is the time and space complexity of ec with respect to samples and snps?

hexhead commented 11 years ago

EC is a combination of ReliefF and random forests, whose space and time complexities are explained elsewhere.

Random forests, Section II E of: https://lirias.kuleuven.be/bitstream/123456789/316661/1/

ReliefF, Section 2.4 of: http://lkm.fri.uni-lj.si/xaigor/slo/clanki/MLJ2003-FinalPaper.pdf

On Tue, Sep 25, 2012 at 4:29 AM, biosyssun notifications@github.com wrote:

If it is because of limited RAM, some notices given by ec is useful. My RAM is 16G. what is the time and space complexity of ec with respect to samples and snps?

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8848403.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 11 years ago

Do you have some ideas about EC not completing the big dataset GWAS_Statistics_plink.

hexhead commented 11 years ago

I can only suggesting ramping up the data set size until it fails. I think memory is being used up.

On Wed, Sep 26, 2012 at 3:01 AM, biosyssun notifications@github.com wrote:

Do you have some ideas about EC not completing the big dataset GWAS_Statistics_plink.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8882005.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

hexhead commented 11 years ago

You could try LD pruning (PLINK can do this) as a pre-processing step to reduce the number of SNPs and hence the memory requirement.

On Wed, Sep 26, 2012 at 3:01 AM, biosyssun notifications@github.com wrote:

Do you have some ideas about EC not completing the big dataset GWAS_Statistics_plink.

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8882005.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/

biosyssun commented 11 years ago

Maybe the number of snps determined mostly the needed RAM size?

hexhead commented 11 years ago

The majority of RAM is in the data itself. When Random Jungle loads, it has a second copy of the data set in memory. In you case you have:

1577 samples * 410969 SNPS * 8 bytes per SNP * 2 copies

1577 * 410969 * 8 * 2[1] 10369569808

This is around 10 GB just for the data set. Other metadata (sample and SNP IDs, etc) takes even more.

ReliefF is exponential in the number of samples, that is the space to store the distance matrix for all samples to all samples used in the algorithmic loops. But this is small compared to the data set itself. There are reasons why we have 8 bytes per SNP that I won't go into here, but that's the situation. Consider also any other operating system or runtime overhead and RAM can be pushed to the limit. Still, I don't see your jobs exceeding 16 GB. We have run 4806 samples by 360,000 SNPs in less than 16 GB. I can't tell much of anything from here. You need to find some way to monitor your job's RAM usage. I could try the data set myself if it is public or you can share it with me somehow.

On Wed, Sep 26, 2012 at 9:02 PM, biosyssun notifications@github.com wrote:

Maybe the number of snps determined mostly the needed RAM size?

— Reply to this email directly or view it on GitHubhttps://github.com/insilico/encore/issues/1#issuecomment-8916848.

Bill C. White, MS Research Associate Programmer University of Tulsa Tandy School of Computer Science http://insilico.utulsa.edu/