kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
50 stars 14 forks source link

ValueError: too many values to unpack (expected 2) #9

Closed oushujun closed 5 years ago

oushujun commented 6 years ago

CentOs 7 Python 3.6.6 :: Anaconda, Inc.

I was testing fvecSim using the mosquito data and found a bug:

python diploSHIC.py fvecSim diploid hard_0.msOut.gz test_hard_0.diploid.fvec --totalPhysLen 55000 --maskFileName Anopheles-gambiae-PEST_CHROMOSOMES_AgamP3.accessible.fa.gz --chrArmsForMasking 3R

Program output: file name='hard_0.msOut.gz'vcfForMaskFileName='None': not masking any genotypes! reading masking data...reading Anopheles-gambiae-PEST_CHROMOSOMES_AgamP3.accessible.fa.gz checked genotypes at 0 sites Traceback (most recent call last): File "/opt/software/SHIC/diploSHIC/makeFeatureVecsForSingleMsDiploid.py", line 64, in <module> sampleToPopFileName=sampleToPopFileName) ValueError: too many values to unpack (expected 2) /opt/software/miniconda/4.4.10--GCC-4.9.4/bin/python /opt/software/SHIC/diploSHIC/makeFeatureVecsForSingleMsDiploid.py hard_0.msOut.gz 55000 11 Anopheles-gambiae-PEST_CHROMOSOMES_AgamP3.accessible.fa.gz None None None 0.75 3R 0.25 None test_hard_0.diploid.fvec

When removing the --maskFileName and --chrArmsForMasking parameters, it runs fine.

BTW, is the program designed to sample the provided mask file randomly (or sequencially?) to mimic true data?

Thanks, Shujun

andrewkern commented 6 years ago

this looks like a bug that was recently introduced to the code. hold on and we will fix it.

stsmall commented 5 years ago

Hi @andrewkern , @oushujun, I just cloned the repo and setup. Still getting the same error when I invoke --maskFileName and --chrArmsForMasking on the example data. Is there a work around? thanks, @stsmall

andrewkern commented 5 years ago

this was code that @dschride had changed recently but i don't believe he has pushed his patch to github. @dschride did you push the new masking version?

oushujun commented 5 years ago

I used the buggy version and not supplying any masking to the simulated data - the genome I used is very good and only contains very limited Ns and physical gaps, so I figure not making would not be too big a problem.

Shujun

On Wed, Nov 21, 2018, 7:16 PM Andrew Kern <notifications@github.com wrote:

this was code that @dschride https://github.com/dschride had changed recently but i don't believe he has pushed his patch to github. @dschride https://github.com/dschride did you push the new masking version?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kern-lab/diploSHIC/issues/9#issuecomment-440859957, or mute the thread https://github.com/notifications/unsubscribe-auth/AFt-NPIx_apXTLj0O9qM9J2hSJ_qLcnhks5uxezlgaJpZM4WZWnq .

dschride commented 5 years ago

Shujun, please pull the latest version and try again. The cause of this issue is that diploid mode did not support masking sites without also masking genotypes, but I have added this functionality. Try running again and the same manner and let me know if this issue is resolved.

stsmall commented 5 years ago

Thank you @dschride! After this update the example data finishes without error. My data using diploid and a mask file also finished without error.

dschride commented 5 years ago

No problem!

lokeyCEU commented 5 years ago

I am having a similar problem Ran the following: python diploSHIC.py fvecSim diploid sims/TEST.ms sims/TEST.fvec --totalPhysLen 110000

Got this error: /anaconda3/bin/python makeFeatureVecsForSingleMsDiploid.py sims/TEST.ms 110000 11 None None None None 0.75 all 0.25 None sims/TEST.fvec file name='sims/TEST.ms'Traceback (most recent call last): File "makeFeatureVecsForSingleMsDiploid.py", line 17, in <module> trainingDataFileObj, sampleSize, numInstances = openMsOutFileForSequentialReading(trainingDataFileName) File "/diploSHIC/msTools.py", line 150, in openMsOutFileForSequentialReading program, numSamples, numSims = header.strip().split()[:3] ValueError: not enough values to unpack (expected 3, got 1)

But I don't see anything wrong with my header do you? Here is the .ms file in question: TEST.ms.txt

dschride commented 5 years ago

In the file you have attached I don't see a header line, or the random seed line which would typically appear right below it in ms-style output. For example, if I run the following command using ms:

ms 10 1 -t 1

My output will look something like this:

ms 10 1 -t 1 
11048 49753 20103

//
segsites: 3
positions: 0.0447 0.0800 0.2977 
111
000
110
110
110
000
110
110
000
110

However, your file starts with:

//
segsites: 8721

Some of the information in the header line is needed by diploSHIC (the sample size for each simulation and the number of simulated replicates) while other information (the path to the simulation program and additional command line arguments, and the entirety of the random seed line) are not explicitly read by diploSHIC but it does expect them to be there for proper parsing. If I modify the beginning of your simulation output file to the following then the fvecSim command runs properly:

blah 100 1
blah

//
segsites: 8721
lokeyCEU commented 5 years ago

Fixed it, thanks.