rajanil / fastStructure

A variational framework for inferring population structure from SNP genotype data.
MIT License
134 stars 50 forks source link

--cv failed #21

Open joqb opened 9 years ago

joqb commented 9 years ago

Hi there,

I'm trying fastStructure on a relatively small individuals dataset (25) but very large (10000 SNPs from GBS) in .str format. When I tried to run it with --cv=5, for I thought it would bring the same as running repetitions in the regular Structure, I only get FAILED {1,} to the screen and Structure keeps running. When I tried the same with the testdata it worked fine. Running on my data without --cv works also fine but is crazy fast with the simple prior (4 seconds which leaves me wondering...) but with the logistic prior it's much slower (didn't update the log file in an hour...)

Any suggestion?

Thanks, Nath

LaureneAlicia commented 8 years ago

Hi Nath,

I run into the exact same problem as you while using fastStructure. I have a dataset with 48 individuals and 800 SNPs in a .str file. When I use the --cv option, I get a "Failed" message and without it only takes 2-3 seconds. Did you ever find out what was the issue?

Thanks, Laurène

rajanil commented 8 years ago

Hi Laurene, The --cv option would make the software run slower (e.g., --cv = 5 would make it run 5 times slower, since it runs 5-fold cross-validation and reports ancestry proportions resulting from aggregating these 5 runs). However, I have not encountered the Failed error message before. Could you please copy-paste or provide a snap shot of the error? If you could share the dataset so I can replicate and fix the error, that would be really helpful!

thanks!

LaureneAlicia commented 8 years ago

Hi Anil,

Thank you very much for your answer!

Since the software only takes 2 or 3 seconds to run on my dataset (48 ind, 800 SNPs) for each K, it would be no problem if the --cv option would make it run several times slower. My understanding of this option is that it's the number of replicates for each K, correct me if I'm wrong. The runs produce the same results (same output files) when I use the --cv option and when I don't use it, except in the .log file the last line says "CV error = 0.2362436, 0.0097023" and in the terminal it gives me several "Failed" messages: python ./structure.py -K 2 --input=structure --output=output/test --cv=3 --full --format=str Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed

Some people have reported the same problem before but I haven't seen any explanation or solution so far: https://groups.google.com/forum/#!search/faststructure$20cv/structure-software/cXyfoWXsOe4/Mix0Fo4nDAAJ

I carried out the analysis (without the --cv option), using chooseK.py and distruct.py and the final plot gives meaningful results, which are nearly identical to the results I got from the classic Structure software. Running fastStructure is much faster (which is the all purpose) but I would like to have replicates for each K (like in Structure) which would be then used by chooseK.py to choose the K more reliably.

I attached my input file so that you can have a look at the issue (I had to .zip it since github wouldn't accept a file with .str extension) structure.str.zip

Thank you very much for your time, Laurène

elinck commented 8 years ago

Hi Anil (and others),

I encountered the same error today using fastStructure v1.0 and the following command:

python /home/elinck/bin/fastStructure/structure.py -K 2 --input /home/elinck/atlapetes/atlapetes --output /home/elinck/atlapetes/atlapetes_output --format str --cv 3

My .str file is zipped and attached. Curious if you ever figured out what was causing the issue. Thanks in advance!

atlapetes.str.zip

atcg commented 8 years ago

I'm also getting these errors. It looks like it could be from lines 293-305 of fastStructure.pyx? :

 # test to ensure that for all partitions, the loci are all variant
        newmasks = []
        for mask in masks:
            G = Gtrue.copy()
            Gmask = -1*np.ones((N,L), dtype='int8')
            Gmask[mask[0],mask[1]] = G[mask[0],mask[1]]
            G[mask[0],mask[1]] = 3
            if not (((G==1)+(G==2)).sum(0)==0).any():
                newmasks.append(mask)

        if not len(newmasks)>=cv:
            wellmasked = False
            print "Failed"

I do not have any invariant columns in my dataset, and I get the error even if I remove all tri-allelic sites from my input. I'm calling fastStructure as follows:

python fastStructure/structure.py -K 2 --input=inputFile --output=outputFile --cv=5 --format=str

I'm using Ubuntu 14.04.04 LTS, 64 bit.

atcg commented 8 years ago

I can confirm that I no longer get these errors if I convert my data to plink .bed format and remove any sites with over 90% missing data and minor allele frequencies greater than 99% or lower than 1%.

xiekunwhy commented 8 years ago

Hi, @atcg , I got the same error when I use plink .bed format as input! Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed Failed

vmkalbskopf commented 3 years ago

Has this been addressed? I am running into the same error. I'm using a plink bed file as the input file.