Closed enigmargs closed 4 years ago
@jma7 I believe we were able to do this in Version 2.0.x. Would you mind checking what's going on with the new genotype data structure -- are we still allowing for genotype level annotations?
9 days already? Time with working at home just flies.
@jma7 Do you know if your new importer supports this option? My understanding is that it should but I could be wrong. Please feel free to let me know if it will take more than a few minutes for you to figure out because you are now officially off the hook from this project.
@jma7 I believe we were able to do this in Version 2.0.x. Would you mind checking what's going on with the new genotype data structure -- are we still allowing for genotype level annotations?
I tried importing a similar file with same geno information (GT:AD:DP:GQ:PL) and it works. Would you mind me telling what do you mean by 'genotype level annotation'? (I'm a newbie in this and get confused with terms quite often). The file where I'm facing issue has 4 million variants in total.
@enigmargs AD:DP:GQ:PL
is my terminology for genotype -level annotation as opposed to what you will find in INFO
for variant level annotations.
@enigmargs
AD:DP:GQ:PL
is my terminology for genotype -level annotation as opposed to what you will find inINFO
for variant level annotations.
Thanks for the clarification.
I apologize as I closed the ticket by mistake!
I tried importing a similar file with same geno information (GT:AD:DP:GQ:PL) and it works.
So one file works and another did not? Could you send the first 100 lines of the file that works?
Hi Bo, Here (test2.txt) is the file where it works. 3 subjects show 560+ show variants each when I use vtools show genotypes
. Whereas, in my test.txt (attached in OP - a part of the larger file I'm working on) shows zero genotypes for all subjects (and those numbers are only in 3 digits when I import file with 4 million+ variants)
Your VCF file contains lines such as
1 664499 . TTGAG *,T 16761.1 PASS
which has *
, which I currently do not know what it means. After removing these lines (I use :g/\*,/d
and :g/,\*/d
with vi, but you should be able to use other tools), the file could be loaded
$ vtools init test -f
$ vtools import test.vcf --var_info AC AF AN DP --build hg19 --geno_info DP_geno
INFO: Importing variants from test.vcf (1/1)
test.vcf: 100% [==================] 483 20.5K/s in 00:00:00
INFO: 466 new variants (437 SNVs, 8 insertions, 21 deletions) from 483 lines are imported.
Importing genotypes: 100% [=========] 466 11.6K/s in 00:00:00
$ vtools show genotypes
sample_name filename num_genotypes sample_genotype_fields
GBRUNL2A00006001 test.vcf 399 DP,GT
GBRUNL2A00006003 test.vcf 343 DP,GT
GBRUNL2A00006004 test.vcf 388 DP,GT
GBRUNL2A00007002 test.vcf 317 DP,GT
GBRUNL2A00009001 test.vcf 355 DP,GT
GBRUNL2A00009011 test.vcf 403 DP,GT
GBRUNL2A00030003 test.vcf 347 DP,GT
GBRUNL2A00030005 test.vcf 391 DP,GT
GBRUNL2A00030007 test.vcf 403 DP,GT
GBRUNL2A00030009 test.vcf 401 DP,GT
GBRUNL2A00030010 test.vcf 401 DP,GT
GBRUNL2A00030014 test.vcf 392 DP,GT
GBRUNL2A00030015 test.vcf 306 DP,GT
GBRUNL2A00030016 test.vcf 407 DP,GT
GBRUNL2A00030017 test.vcf 381 DP,GT
GBRUNL2A00030019 test.vcf 399 DP,GT
I will have to investigate the role of *
and see how vtools
should handle it.
Thank you for pointing this out - much appreciate your help. I went through other columns link INFO, filter etc before posting my question and didn't expect any problem with this. I will try to dig out reasons for these special characters in my file.
It's working for a section of the file. I will replicate on the actual file and update.
Hi, could you please have a look at this follow up question? I have imported phenotype information vtools phenotype --from_file dummy_phenotype.csv --delimiter ","
vtools phenotype --from_stat 'total_sample=#(GT)'
shows the following stdout.
And I can see that it is not calculating for all samples (same is the case with other fields like max(DP_geno), #(alt) etc.
vtools phenotype --output total_sample
I'm not able to think of any reason for this.
Could you PM a dataset that could reproduce this problem? As a starter, could you use -j1
(use a single processor) and see if this can avoid the deadlock?
Hi, using -j1
parameter is working for my sample file (not tried in my main file yet. I hope it will work there as well.). I had a quick look at -j
parameter here. What does it exactly mean by "number of jobs" ? Does it have anything to do with number of processers?
I'm asking this because, further I'm not able to export one of the variant table in vcf format. A screenshot shown below.
It just doesn't progress at all. I tried inserting -j1
here without any luck.
Number of jobs roughly means number of processor cores to analyze the data, which could cause race condition that we are not aware of. This is a bug on our end and I will try to fix it.
The export command is designed to export a small number of variants (up to thousands) with their annotations. It tends to be slow because it needs to retrieve pieces of information from databases. A text-processor based tools would be much more efficient in converting an input vcf file with a large number of variants to another.
The table that I'm exporting has only 295 variants and it is still hanging. I'm holding on to this as it is linked to the next step that I'm doing;
I'm using vtools version 3.0.2 and king 2.2.4 to vtools execute KING
(though my version of vtools doesn't show any pipeline when I use vtools show pipelines
, but it executes one). It stops at _king_30 showing empty .tped file_ with a return code 3. I tried to debug the pipeline codes that I got from here wherein I can see it uses export in _king_20_ step.
Supposing that I have to use text processing tools (like awk, sed etc) to export variant, do you suggest me to use vtools output common chr, pos, mut_type etc.
as a starting point to export?
I have seen the problem with KING, which was caused by version incompatibility (@gaow should have updated the pipeline, but I am not sure).
The export hang should be a bug, but please update to the latest version of vtools to make sure we are on the same page.
I updated my vtools to v3.1.2, python 3.7.6, King 2.2.4
As I see in this disussion, @gaow commented that KING is updated. But I'm getting the error exactly as mentioned in that ISSUE.
Meanwhile, the issue with EXPORT command remain the same in my latest version. Could you please help with it? (I tried exporting to .tped as well without any luck)
Could you PM (email Bo.Peng@bcm.edu) a portion of the data and the commands you used so that I could reproduce the problem you have? I can also try to reproduce your problem with the use of KING.
I have sent an email with required files. Much appreciate your help with this.
Thanks for sending me the files. I can now reproduce the hanging issue with vtools export
and I am investigating what is happening.
Note that this is a bug with the hdf5 storage engine and you can get around it for now with
vtools init rgs -f --store sqlite
Update: seems to work all right for the output of master variant
table.
Yes. --store sqlite
is working and so is KING.
Does sqlite consume more memory than hdf5? Seems like that to me.
Yes, sqlite is the traditional way which uses more RAM and diskspace, the new storage model is more efficient but less mature now. I have identified the source of problem and is trying to fix it.
I have traced the problem to https://github.com/vatlab/varianttools/blob/master/src/variant_tools/exporter_reader.py#L401 function. What is happening here is that
The problem happens here when the variant has ID 58, but the genotype side returns 2, 3, 4, ..., 57, and then 59, ... so no match is found.
@jma7 Do you have on the top of your mind why this happens? I can dig deeper if you cannot recall the details here.
@enigmargs I have fixed the vtools export
bug and released variant tools 3.1.3. Please update and feel free to let us know if you encounter any other problem.
I created a new bug report for the KING pipeline. #147 because it is related to an external pipeline.
Hi, I'm trying to import a vcf file (test.txt) with 400+ variants. Genotype field has the following information GT:AD:DP:GQ:PL.
I used the following versions of codes (where I changed --geno_info inputs),
vtools import test.vcf --var_info AC AF AN DP MLEAC MLEAF QD filter --geno_info DP_geno --build hg19
vtools import test.vcf --var_info AC AF AN DP MLEAC MLEAF QD filter --geno_info DP --build hg19
While processing it shows, ValueError('Cannot import field {} from input file.'.format(fld.name)) error and
vtools show genotypes
shows no genotype information per individual - as seen in the screehshot below.Could you please help me where am I going wrong with it? How do I import/parse the information as in my VCF? Also, when I add AD in --geno_info it throws, ValueError: Cannot import field AD from input file error!