Closed gaow closed 5 years ago
@jma7 sorry I was having #118 in mind when I submitted this ticket thinking of genotype related issues. But maybe I was mistaken : --samples RACE=1
will lead to what seems very simple operation on genotype database, that is, to compute counts for variants in subset of data. It sounds simpler than #118 and probabily there is a simpler fix -- is that the case?
Checking your VAT example now. Note that we are deprecating sourceforge and the latest documentation is at https://vatlab.github.io/vat-docs/.
@BoPeng thanks! Sorry just to clarify -- you are checking it with version 3.0.x and see if there is a quick fix? Because I've built the docker image for version 2.7.2 and i am ready to switch to that version to put up on the Juptyer Hub. But it would be nice if there might be a fix for 3.0.x for that feature. I'm holding off and keeping the jupyter Hub running 3.0.x for you to check out.
I am working through the bug list but I do not think genotype query could be fixed any time soon. In the meantime you should be able to use 2.7.x.
I see -- then I'll go reset that jupyterhub server to using version 2.7.x. It means it will be shutdown for a few minutes then up with 2.7.x. But the data to reproduce the problem with 3.0.x is the same.
@BoPeng i confirm version 2.7.2 works for my tutorial (see beginning of the thread) with the only issue that vtools liftover hg19 --flip
command takes very long time (compared to version 3.0.x). But everything else to be working so far.
How long is "very" long? It is possible that you were downloading liftover tools and related chain file the first time, and you could use it directly the second time around.
How long is "very" long?
About 2 min for the small test file. I see the CPU usage is very low and there is the journal
file constantly updated for the variant database -- that might be the reason.
It is possible that you were downloading liftover tools and related chain file
I already have them in cache so it wont be the case. But we can worry about this later when we fix up the 3.0.x version and revisit that example.
vtools phenotype --from_stat 'CEU_totalGD10=#(GT)' 'CEU_numGD10=#(alt)' --genotypes 'DP_geno>10' --samples "RACE=1"
gives the same error
Unable to open/create file 'tmp_1_90_genotypes.h5'
HDF5 error back trace
File "H5F.c", line 509, in H5Fopen
unable to open file
File "H5Fint.c", line 1567, in H5F_open
unable to lock the file
File "H5FD.c", line 1640, in H5FD_lock
driver lock request failed
File "H5FDsec2.c", line 959, in H5FD_sec2_lock
unable to lock file, errno = 35, error message = 'Resource temporarily unavailable'
End of HDF5 error back trace
This might not be the problem of the command as the file could be left locked from some previous command.
Well, this is a different error, though. I didn't run into it in my test (the error reported in the first post of this ticket was the only error I had) but maybe it was because I did this in my docker:
# https://community.paperspace.com/t/storage-and-h5py-pytables-e-g-keras-save-weights-issues-heres-why-and-how-to-solve-it/430
# HDF5 locking issues
ENV HDF5_USE_FILE_LOCKING FALSE
which is what I copied from some of @jma7 's recipe. But I guess we need to somehow code it into vtools rather than relying on env.
That post talks about NFS mount but I am running vtools locally... I will look into it.
The command works if I use 1 worker (-j1
), so this is a problem with multi-processing. From what I read online (e.g. https://github.com/PyTables/PyTables/blob/master/examples/multiprocess_access_queues.py) , hdf5 files cannot be opened by multiple processes. Is this true?
Ref:
a
mode. It might help if open in r
or r+
mode if multi-processing is allowed in readonly mode. https://github.com/vatlab/varianttools/blob/master/src/variant_tools/accessor.py#L101@jma7 Please clarify.
According to my experiment, the command
vtools remove genotypes "DP_geno<10"
corrupted the genotype database and caused the following commands to fail. Commenting out this line of code avoids the file corruption.
I could run through the notebook so I don't really have problem with this command. How did you resolve the other error? Are there some version issues with the HDF5 library?
Could you try the notebook on another environment, such as a Linux vm?
To answer the questions:
a
mode is not allowed.The patch opens hdf5 files in read_only mode for calculating genotype statistics, and fixes the problem of running
vtools phenotype --from_stat 'CEU_totalGD10=#(GT)' 'CEU_numGD10=#(alt)' --genotypes 'DP_geno>10' --samples "RACE=1"
in multiprocessing mode.
The remove genotype error happened because of multi-processing was used to process each h5 file, which is ok, but the way the processes are started are wrong because the store is opened in the master process and passed to subprocesses...
The patch fixes this by creating store handlers in each subprocess.
@gaow I cannot run through the notebook here because my version of KING and plink might be wrong, but I have released variant tools 3.0.6 so that you can check if it works on your side.
I noticed that you updated KING.pipeline, let me know if I should upload it to the repository.
Thanks @BoPeng just wrapped up other stuff and found a moment to check it out. It seems installation methods have changed -- the old docker recipe no longer works and conda release is not there yet. I have used apt-get install -y libboost-dev libgsl-dev
before i install the package but it did not work. I can of course dig further but if there are updated instructions (for Linux) or a conda release out there that'd be a lot better. Currently bioconda does not work:
PackagesNotFoundError: The following packages are not available from current channels:
- variant_tools
Current channels:
- https://conda.anaconda.org/bioconda/linux-64
- https://conda.anaconda.org/bioconda/noarch
- https://conda.anaconda.org/conda-forge/linux-64
- https://conda.anaconda.org/conda-forge/noarch
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
I noticed that you updated KING.pipeline, let me know if I should upload it to the repository.
It worked for version 2.7.2 after I updated. I'll let you know if it works on 3.0.6 before we include it. But it looks like you've got version 3.0.6 working for my notebook example, that's great! Have you noticed the slowness in vtools liftover --flip
or it is just on my end?
I am still working on a bioconda recipe. Will let you know when it is merged.
@jma7 @gaow variant tools is now available at bioconda so the recommended installation command would be
conda install variant_tools -c bioconda
Thank you @BoPeng . Shall we update KING.pipeline
as well? I'm getting a warning for that file being out of date . Also, did you notice the slowness with vtools liftover --flip
when you were testing out the commands from the notebook? I'm now running it on my desktop (with a slow hard drive); it took 15 minutes for that command to finish.
I am on a SSD drive and did not notice any problem with vtools liftover --flip
. Let me check.
Thank you @BoPeng . Yes on cloud server it was not an issue. On the SoS live server it is a regular disk (but a relatively new disk) it takes 2 to 3 min. On my desktop with an old disk it is 15 minutes. The reason I think it is worth noticing is that when fields are updated from text annotations, the performance is consistently good. I take that liftover is like an annotation on *.proj
so I dont understand why it is a lot slower.
Has the original issue in this ticket been resolved? Maybe it is not relevant to #118? at least using version 3.0.9 I was able to run the notebook in the original ticket without a problem. If so, we can close the issue.
I agree.
Separated from https://github.com/vatlab/varianttools/issues/80#issuecomment-539294504
Here is a an example running 3.0.3 release:
http://128.135.144.117/vat.html
The particular command that failed is under section 2.3:
vtools select variant "CEU_mafGD10>=0.05" --samples "RACE=1" -t common_ceu
Additionally, The line
INFO: NumExpr defaulting to 1 threads.
keeps showing up -- shall we move it to DEBUG?