kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
49 stars 14 forks source link

nan reported for prob #13

Closed stsmall closed 5 years ago

stsmall commented 5 years ago

Hi @andrewkern, I ran through the test example and all worked as expected. So the program is working correctly. When I run the 'predict' step I get 'nan' for my probs. I did not get errors anywhere else in the pipeline.

3R 97501 102500 75001-130000 hard nan nan nan nan nan 3R 102501 107500 80001-135000 hard nan nan nan nan nan 3R 107501 112500 85001-140000 hard nan nan nan nan nan

I realize that I did not give you much information, what would be helpful? I can email the json and hdf5 files if it would help. thank you! @stsmall

stsmall commented 5 years ago

I tested my model on your example data and there were no 'nan', of course it is improper classification but seems to behave. I tested your model on my data and there were the 'nan' again. Hmm, not sure what is going on, but I will try to rebuild the fvec from the vcf and predict.

stsmall commented 5 years ago

I recalc stats using fvecVcf for a short segment and it works! No idea what was wrong the first time as the fvecVcf looks the same.

andrewkern commented 5 years ago

awesome. glad it's working

stsmall commented 5 years ago

Hi @andrewkern, The 'nan' error was fixed when I only allowed sites with no missing data. Is this expected behavior and I just overlooked the documentation, or is my error likely related to something else? thanks!

stsmall commented 5 years ago

Hi @andrewkern, @dschride I thought that removing the missing data fixed the issue with nan probs. My mistake is that I was sampling from the file as it was building and then running tests. The odd behavior is that if I subsample the fvecVcf file, e.g., head -n100, it works as expected with probs, but if I attempt to run on the entire Chr arm file it returns 'nan' for all probs. Maybe there is a problematic line in the fvecVcf output? Can I send you my fvecVcf file via email? thank you, @stsmall

dschride commented 5 years ago

Sure, email me your output.

dschride commented 5 years ago

Okay, so you have a few nans in there. If you remove the following lines I bet it would work:

3R 24627501 24632500 3R 24632501 24637500 3R 24637501 24642500

It seems that in these lines most of our stats based on diploid genotype strings (i.e. our "diplotypes" in the paper), which I don't think should happen unless there is a fairly small number of polymorphisms in those windows and thus you wouldn't be losing anything informative by throwing them out anyway. But you may wish to verify this before proceeding.

stsmall commented 5 years ago

yep, that was it. Thanks @dschride !!

quickly ... it seems that linkedSoft is spelled as likedSoft in the output