vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

`vtools select` doesnt work when selecting variants for given samples #120

Closed gaow closed 5 years ago

gaow commented 5 years ago

Separated from https://github.com/vatlab/varianttools/issues/80#issuecomment-539294504

Here is a an example running 3.0.3 release:

http://128.135.144.117/vat.html

The particular command that failed is under section 2.3:

INFO: NumExpr defaulting to 1 threads.
HDF5 error back trace

  File "H5Dio.c", line 199, in H5Dread
    can't read data
  File "H5Dio.c", line 601, in H5D__read
    can't read data
  File "H5Dchunk.c", line 2259, in H5D__chunk_read
    unable to read raw data chunk
  File "H5Dchunk.c", line 3610, in H5D__chunk_lock
    unable to read raw data chunk
  File "H5Fio.c", line 118, in H5F_block_read
    read through page buffer failed
  File "H5PB.c", line 732, in H5PB_read
    read through metadata accumulator failed
  File "H5Faccum.c", line 260, in H5F__accum_read
    driver read request failed
  File "H5FDint.c", line 200, in H5FD_read
    addr overflow, addr = 1839271, size = 19228, eoa = 1510399

End of HDF5 error back trace

Problems reading the array data.
cannot unpack non-iterable NoneType object
HDF5 error back trace ...

Additionally, The lineINFO: NumExpr defaulting to 1 threads. keeps showing up -- shall we move it to DEBUG?

gaow commented 5 years ago

@jma7 sorry I was having #118 in mind when I submitted this ticket thinking of genotype related issues. But maybe I was mistaken : --samples RACE=1 will lead to what seems very simple operation on genotype database, that is, to compute counts for variants in subset of data. It sounds simpler than #118 and probabily there is a simpler fix -- is that the case?

BoPeng commented 5 years ago

Checking your VAT example now. Note that we are deprecating sourceforge and the latest documentation is at https://vatlab.github.io/vat-docs/.

gaow commented 5 years ago

@BoPeng thanks! Sorry just to clarify -- you are checking it with version 3.0.x and see if there is a quick fix? Because I've built the docker image for version 2.7.2 and i am ready to switch to that version to put up on the Juptyer Hub. But it would be nice if there might be a fix for 3.0.x for that feature. I'm holding off and keeping the jupyter Hub running 3.0.x for you to check out.

BoPeng commented 5 years ago

I am working through the bug list but I do not think genotype query could be fixed any time soon. In the meantime you should be able to use 2.7.x.

gaow commented 5 years ago

I see -- then I'll go reset that jupyterhub server to using version 2.7.x. It means it will be shutdown for a few minutes then up with 2.7.x. But the data to reproduce the problem with 3.0.x is the same.

gaow commented 5 years ago

@BoPeng i confirm version 2.7.2 works for my tutorial (see beginning of the thread) with the only issue that vtools liftover hg19 --flip command takes very long time (compared to version 3.0.x). But everything else to be working so far.

BoPeng commented 5 years ago

How long is "very" long? It is possible that you were downloading liftover tools and related chain file the first time, and you could use it directly the second time around.

gaow commented 5 years ago

How long is "very" long?

About 2 min for the small test file. I see the CPU usage is very low and there is the journal file constantly updated for the variant database -- that might be the reason.

It is possible that you were downloading liftover tools and related chain file

I already have them in cache so it wont be the case. But we can worry about this later when we fix up the 3.0.x version and revisit that example.

BoPeng commented 5 years ago
 vtools phenotype --from_stat 'CEU_totalGD10=#(GT)' 'CEU_numGD10=#(alt)' --genotypes 'DP_geno>10' --samples "RACE=1"

gives the same error

Unable to open/create file 'tmp_1_90_genotypes.h5'
HDF5 error back trace

  File "H5F.c", line 509, in H5Fopen
    unable to open file
  File "H5Fint.c", line 1567, in H5F_open
    unable to lock the file
  File "H5FD.c", line 1640, in H5FD_lock
    driver lock request failed
  File "H5FDsec2.c", line 959, in H5FD_sec2_lock
    unable to lock file, errno = 35, error message = 'Resource temporarily unavailable'
End of HDF5 error back trace

This might not be the problem of the command as the file could be left locked from some previous command.

gaow commented 5 years ago

Well, this is a different error, though. I didn't run into it in my test (the error reported in the first post of this ticket was the only error I had) but maybe it was because I did this in my docker:

# https://community.paperspace.com/t/storage-and-h5py-pytables-e-g-keras-save-weights-issues-heres-why-and-how-to-solve-it/430
# HDF5 locking issues
ENV HDF5_USE_FILE_LOCKING FALSE

which is what I copied from some of @jma7 's recipe. But I guess we need to somehow code it into vtools rather than relying on env.

BoPeng commented 5 years ago

That post talks about NFS mount but I am running vtools locally... I will look into it.

BoPeng commented 5 years ago

The command works if I use 1 worker (-j1), so this is a problem with multi-processing. From what I read online (e.g. https://github.com/PyTables/PyTables/blob/master/examples/multiprocess_access_queues.py) , hdf5 files cannot be opened by multiple processes. Is this true?

Ref:

  1. Phenotype multi-processing code: https://github.com/vatlab/varianttools/blob/master/src/variant_tools/phenotype.py#L158
  2. HDF5 is always opened in a mode. It might help if open in r or r+ mode if multi-processing is allowed in readonly mode. https://github.com/vatlab/varianttools/blob/master/src/variant_tools/accessor.py#L101

@jma7 Please clarify.

BoPeng commented 5 years ago

According to my experiment, the command

vtools remove genotypes "DP_geno<10"

corrupted the genotype database and caused the following commands to fail. Commenting out this line of code avoids the file corruption.

junma80 commented 5 years ago

I could run through the notebook so I don't really have problem with this command. How did you resolve the other error? Are there some version issues with the HDF5 library?

BoPeng commented 5 years ago

Could you try the notebook on another environment, such as a Linux vm?

BoPeng commented 5 years ago

To answer the questions:

  1. According to PyTables FAQ, "Concurrent reads are no problem at all."
  2. Concurrent write, or open in a mode is not allowed.

The patch opens hdf5 files in read_only mode for calculating genotype statistics, and fixes the problem of running

 vtools phenotype --from_stat 'CEU_totalGD10=#(GT)' 'CEU_numGD10=#(alt)' --genotypes 'DP_geno>10' --samples "RACE=1"

in multiprocessing mode.

BoPeng commented 5 years ago

The remove genotype error happened because of multi-processing was used to process each h5 file, which is ok, but the way the processes are started are wrong because the store is opened in the master process and passed to subprocesses...

The patch fixes this by creating store handlers in each subprocess.

BoPeng commented 5 years ago

@gaow I cannot run through the notebook here because my version of KING and plink might be wrong, but I have released variant tools 3.0.6 so that you can check if it works on your side.

I noticed that you updated KING.pipeline, let me know if I should upload it to the repository.

gaow commented 5 years ago

Thanks @BoPeng just wrapped up other stuff and found a moment to check it out. It seems installation methods have changed -- the old docker recipe no longer works and conda release is not there yet. I have used apt-get install -y libboost-dev libgsl-dev before i install the package but it did not work. I can of course dig further but if there are updated instructions (for Linux) or a conda release out there that'd be a lot better. Currently bioconda does not work:


PackagesNotFoundError: The following packages are not available from current channels:

  - variant_tools

Current channels:

  - https://conda.anaconda.org/bioconda/linux-64
  - https://conda.anaconda.org/bioconda/noarch
  - https://conda.anaconda.org/conda-forge/linux-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.
gaow commented 5 years ago

I noticed that you updated KING.pipeline, let me know if I should upload it to the repository.

It worked for version 2.7.2 after I updated. I'll let you know if it works on 3.0.6 before we include it. But it looks like you've got version 3.0.6 working for my notebook example, that's great! Have you noticed the slowness in vtools liftover --flip or it is just on my end?

BoPeng commented 5 years ago

I am still working on a bioconda recipe. Will let you know when it is merged.

BoPeng commented 5 years ago

@jma7 @gaow variant tools is now available at bioconda so the recommended installation command would be

conda install variant_tools -c bioconda
gaow commented 5 years ago

Thank you @BoPeng . Shall we update KING.pipeline as well? I'm getting a warning for that file being out of date . Also, did you notice the slowness with vtools liftover --flip when you were testing out the commands from the notebook? I'm now running it on my desktop (with a slow hard drive); it took 15 minutes for that command to finish.

BoPeng commented 5 years ago

I am on a SSD drive and did not notice any problem with vtools liftover --flip. Let me check.

gaow commented 5 years ago

Thank you @BoPeng . Yes on cloud server it was not an issue. On the SoS live server it is a regular disk (but a relatively new disk) it takes 2 to 3 min. On my desktop with an old disk it is 15 minutes. The reason I think it is worth noticing is that when fields are updated from text annotations, the performance is consistently good. I take that liftover is like an annotation on *.proj so I dont understand why it is a lot slower.

gaow commented 5 years ago

Has the original issue in this ticket been resolved? Maybe it is not relevant to #118? at least using version 3.0.9 I was able to run the notebook in the original ticket without a problem. If so, we can close the issue.

BoPeng commented 5 years ago

I agree.