kr-colab / mapNN

neural network for estimating demographic maps from SNPs
MIT License
9 stars 1 forks source link

Error "sorry, Dude. Didn't generate enough snps." #5

Open ruizeliot opened 1 month ago

ruizeliot commented 1 month ago

Hi! Thank you for this new program! Could you explain me why I get the error "sorry, Dude. Didn't generate enough snps." during preprocessing?

The error occurs using the function sample_ts with the following code: geno_mat, locs = training_generator.sample_ts(trees[i], args.seed). The function runs for about 1h and then generates this error message. Here is the command line I am using below: python mapnn.py --preprocess --gpu_index any --seed 123 --num_snps 100 --n 49 --map_width 649 --slim_width 649 --sample_grid 1 --tree_list ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/sim_mapNN/chr_t_top_100_sim_123_tree_list.txt --target_list ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/cookie_mapNN/chr_t_top_100_cookie_list.txt --habitat_map ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/data_mapNN/depth_map_wio_rgb_cropped.png --empirical ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/data_mapNN/Chr_t_unlinked_snp_subset_top_100_loci_dapc_fixed_mapnn --out ../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/preprocess_mapNN

The trees were generated with 2 SliM iterations (for the test as I have a big region): defineConstant("W", 649); defineConstant("L", 4); defineConstant("G", 1e8); defineConstant("FECUN", 1/L); defineConstant("maxgens", 2);

Thank you in advance, ER

chriscrsmith commented 1 month ago

Hi and thank you for posting the issue! Is this using one of the slim recipes from the mapnn repo? Did you run slim without mutations, recapitate with msprime, and add mutations to the tree sequence?

ruizeliot commented 1 month ago

Thank you for your quick answer. I used the benchmark.slim that I just slightly modified to changed the image width and the number of generations. What is the procedure to add mutations if needed as I am not familiar with slim and msprime?

benchmark_custom.zip

chriscrsmith commented 1 month ago

We recommend a workflow where you do relatively few generations in slim (expensive) and then "finish" the simulation backwards in time with a coalescent model—called "recapitation"—and then you can add mutations to the completed tree sequence. This workflow is described in general here: https://tskit.dev/pyslim/docs/latest/tutorial.html

The way I did this is using this function (mentioned briefly in the mapnn readme):

from process_input import recap
recap("tempout/sim_123.trees", "tempout/recap_123.trees", seed=123)

Did you use that recap function? I hope that addresses the problem, here.

ruizeliot commented 1 month ago

Hi again! Thanks for your answer!

I tried to run the recap function yesterday but it has been running since then while I have a small SNP dataset (loci number = 100), even though the geographic area is large (large part of the Western Indian Ocean). I checked in details and the function get stuck when running this row: self.Delta = self._create_incidence_matrix().

I also relaunched the SliM program with 50 generations (without the recap) but the preprocessing function of mapNN also fails: 2024-10-24 14:05:32.251140: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0`. 2024-10-24 14:05:32.323373: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-10-24 14:05:33.434702: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT starting pre-processing pipeline getting mean from training, on sim 0 getting sd from training, on sim 0

Traceback (most recent call last): File "/home/eruiz/mapNN/mapnn.py", line 1250, in preprocess() File "/home/eruiz/mapNN/mapnn.py", line 938, in preprocess geno_mat, locs = training_generator.sample_ts(trees[i], args.seed) File "/home/eruiz/mapNN/data_generation.py", line 334, in sample_ts new_genotypes = self.unpolarize(geno_mat0[shuffled_indices[s]]) IndexError: index 481 is out of bounds for axis 0 with size 481`

Thank for your help in advance!

chriscrsmith commented 1 month ago

On the one hand, it's normal for the slim step, and the recapitate to be slow, so the slow runtime by itself isn't necessarily a bug. But I'm confused about the detailed steps in your workflow. In particular you mentioned the SNP count in the file you are recapitating—that is different that the order I normally run things (slim -> recapitate -> add mutations).

You need to recapitate to complete the simulation before running the preprocess step with mapnn.

ruizeliot commented 1 month ago

Hi again! Here is the order I followed: vcf2genos (to get the .genos file) -> create_maps to get a training map -> cookie cutter on the training map -> slim -> preprocessing (this step is possibly preceded by the recap but too slow for me). I try to recapitate the SliM file...

chriscrsmith commented 3 weeks ago

Oops sorry I just noticed your comment about the line "self.Delta = self._create_incidence_matrix()". Will you please share (1) your recap() command, (2) the full error, and (3) and your tskit and msprime versions?

Or is there no error and you're saying it continues running? Can you tell what package that line is in (e.g. maybe msprime)? It might be the case that your simulation just takes a super long time to run.

Eliot-RUIZ commented 3 weeks ago

Hi! I wasn't very clear in my previous message, but the problem with create_incidence_matrix concerns the function SpatialGraph that I was trying to use without the recap step. It outputs the error message I pasted in my previous message, which originates from create_incidence_matrix.

Concerning the recap function, there is no error message but it just get stuck running (I stopped it after 4 days). Here is the code to launch the recap function (I have tskit v0.5.5 and msprime v1.2.0): recap("../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/sim_mapNN/chr_t_top_100_sim_123.trees", "../ADLIFISH_eruiz/test1_mapNN_chr_t_top_100_snp/preprocessing_mapNN/recap_mapNN/chr_t_top_100_recap_123.trees", seed=123).

Meanwhile, I could get the result I wanted with EEMS in about one day using one CPU (achieving satisfying convergence) so I think computation time is not the problem...

andrewkern commented 3 weeks ago

seems very strange that a coalescent simulation would take > 4 days. how large is your chromosome? what is the associated recombination rate?