DP_geno is not parsing genotype fields

enigmargs commented 4 years ago

Hi, I'm trying to import a vcf file (test.txt) with 400+ variants. Genotype field has the following information GT:AD:DP:GQ:PL.

I used the following versions of codes (where I changed --geno_info inputs), vtools import test.vcf --var_info AC AF AN DP MLEAC MLEAF QD filter --geno_info DP_geno --build hg19

vtools import test.vcf --var_info AC AF AN DP MLEAC MLEAF QD filter --geno_info DP --build hg19

While processing it shows, ValueError('Cannot import field {} from input file.'.format(fld.name)) error and vtools show genotypes shows no genotype information per individual - as seen in the screehshot below.

Could you please help me where am I going wrong with it? How do I import/parse the information as in my VCF? Also, when I add AD in --geno_info it throws, ValueError: Cannot import field AD from input file error!

gaow commented 4 years ago

@jma7 I believe we were able to do this in Version 2.0.x. Would you mind checking what's going on with the new genotype data structure -- are we still allowing for genotype level annotations?

BoPeng commented 4 years ago

9 days already? Time with working at home just flies.

@jma7 Do you know if your new importer supports this option? My understanding is that it should but I could be wrong. Please feel free to let me know if it will take more than a few minutes for you to figure out because you are now officially off the hook from this project.

enigmargs commented 4 years ago

@jma7 I believe we were able to do this in Version 2.0.x. Would you mind checking what's going on with the new genotype data structure -- are we still allowing for genotype level annotations?

I tried importing a similar file with same geno information (GT:AD:DP:GQ:PL) and it works. Would you mind me telling what do you mean by 'genotype level annotation'? (I'm a newbie in this and get confused with terms quite often). The file where I'm facing issue has 4 million variants in total.

gaow commented 4 years ago

@enigmargs AD:DP:GQ:PL is my terminology for genotype -level annotation as opposed to what you will find in INFO for variant level annotations.

enigmargs commented 4 years ago

@enigmargs AD:DP:GQ:PL is my terminology for genotype -level annotation as opposed to what you will find in INFO for variant level annotations.

Thanks for the clarification.

enigmargs commented 4 years ago

I apologize as I closed the ticket by mistake!

BoPeng commented 4 years ago

I tried importing a similar file with same geno information (GT:AD:DP:GQ:PL) and it works.

So one file works and another did not? Could you send the first 100 lines of the file that works?

enigmargs commented 4 years ago

Hi Bo, Here (test2.txt) is the file where it works. 3 subjects show 560+ show variants each when I use vtools show genotypes. Whereas, in my test.txt (attached in OP - a part of the larger file I'm working on) shows zero genotypes for all subjects (and those numbers are only in 3 digits when I import file with 4 million+ variants)

BoPeng commented 4 years ago

Your VCF file contains lines such as

1   664499  .   TTGAG   *,T 16761.1 PASS

which has *, which I currently do not know what it means. After removing these lines (I use :g/\*,/d and :g/,\*/d with vi, but you should be able to use other tools), the file could be loaded

$ vtools init test -f
$ vtools import test.vcf --var_info AC AF AN DP  --build hg19 --geno_info DP_geno
INFO: Importing variants from test.vcf (1/1)
test.vcf: 100% [==================] 483 20.5K/s in 00:00:00
INFO: 466 new variants (437 SNVs, 8 insertions, 21 deletions) from 483 lines are imported.
Importing genotypes: 100% [=========] 466 11.6K/s in 00:00:00

$ vtools show genotypes
sample_name         filename    num_genotypes   sample_genotype_fields
GBRUNL2A00006001    test.vcf    399             DP,GT
GBRUNL2A00006003    test.vcf    343             DP,GT
GBRUNL2A00006004    test.vcf    388             DP,GT
GBRUNL2A00007002    test.vcf    317             DP,GT
GBRUNL2A00009001    test.vcf    355             DP,GT
GBRUNL2A00009011    test.vcf    403             DP,GT
GBRUNL2A00030003    test.vcf    347             DP,GT
GBRUNL2A00030005    test.vcf    391             DP,GT
GBRUNL2A00030007    test.vcf    403             DP,GT
GBRUNL2A00030009    test.vcf    401             DP,GT
GBRUNL2A00030010    test.vcf    401             DP,GT
GBRUNL2A00030014    test.vcf    392             DP,GT
GBRUNL2A00030015    test.vcf    306             DP,GT
GBRUNL2A00030016    test.vcf    407             DP,GT
GBRUNL2A00030017    test.vcf    381             DP,GT
GBRUNL2A00030019    test.vcf    399             DP,GT

I will have to investigate the role of * and see how vtools should handle it.

enigmargs commented 4 years ago

Thank you for pointing this out - much appreciate your help. I went through other columns link INFO, filter etc before posting my question and didn't expect any problem with this. I will try to dig out reasons for these special characters in my file.

It's working for a section of the file. I will replicate on the actual file and update.

enigmargs commented 4 years ago

Hi, could you please have a look at this follow up question? I have imported phenotype information vtools phenotype --from_file dummy_phenotype.csv --delimiter ","

vtools phenotype --from_stat 'total_sample=#(GT)' shows the following stdout.

And I can see that it is not calculating for all samples (same is the case with other fields like max(DP_geno), #(alt) etc. vtools phenotype --output total_sample

I'm not able to think of any reason for this.

BoPeng commented 4 years ago

Could you PM a dataset that could reproduce this problem? As a starter, could you use -j1 (use a single processor) and see if this can avoid the deadlock?

enigmargs commented 4 years ago

Hi, using -j1 parameter is working for my sample file (not tried in my main file yet. I hope it will work there as well.). I had a quick look at -j parameter here. What does it exactly mean by "number of jobs" ? Does it have anything to do with number of processers?

I'm asking this because, further I'm not able to export one of the variant table in vcf format. A screenshot shown below. Capture

It just doesn't progress at all. I tried inserting -j1 here without any luck.

BoPeng commented 4 years ago

Number of jobs roughly means number of processor cores to analyze the data, which could cause race condition that we are not aware of. This is a bug on our end and I will try to fix it.

The export command is designed to export a small number of variants (up to thousands) with their annotations. It tends to be slow because it needs to retrieve pieces of information from databases. A text-processor based tools would be much more efficient in converting an input vcf file with a large number of variants to another.

enigmargs commented 4 years ago

The table that I'm exporting has only 295 variants and it is still hanging. I'm holding on to this as it is linked to the next step that I'm doing;

I'm using vtools version 3.0.2 and king 2.2.4 to vtools execute KING (though my version of vtools doesn't show any pipeline when I use vtools show pipelines, but it executes one). It stops at _king_30 showing empty .tped file_ with a return code 3. I tried to debug the pipeline codes that I got from here wherein I can see it uses export in _king_20_ step.

Supposing that I have to use text processing tools (like awk, sed etc) to export variant, do you suggest me to use vtools output common chr, pos, mut_type etc. as a starting point to export?

BoPeng commented 4 years ago

I have seen the problem with KING, which was caused by version incompatibility (@gaow should have updated the pipeline, but I am not sure).

The export hang should be a bug, but please update to the latest version of vtools to make sure we are on the same page.

enigmargs commented 4 years ago

I updated my vtools to v3.1.2, python 3.7.6, King 2.2.4

As I see in this disussion, @gaow commented that KING is updated. But I'm getting the error exactly as mentioned in that ISSUE.
- I checked the changes in KING made by @gaow through this link - it shows .txt output in step _KING41 as opposed to .ped in an older version. But my error still shows, .ped file does not exist.
- I have updated my software using bioconda, so I am not sure whether it has been updated there or not. Can that be causing this error?
Meanwhile, the issue with EXPORT command remain the same in my latest version. Could you please help with it? (I tried exporting to .tped as well without any luck)

BoPeng commented 4 years ago

Could you PM (email Bo.Peng@bcm.edu) a portion of the data and the commands you used so that I could reproduce the problem you have? I can also try to reproduce your problem with the use of KING.

enigmargs commented 4 years ago

I have sent an email with required files. Much appreciate your help with this.

BoPeng commented 4 years ago

Thanks for sending me the files. I can now reproduce the hanging issue with vtools export and I am investigating what is happening.

BoPeng commented 4 years ago

Note that this is a bug with the hdf5 storage engine and you can get around it for now with

vtools init rgs -f --store sqlite

BoPeng commented 4 years ago

Update: seems to work all right for the output of master variant table.

enigmargs commented 4 years ago

Yes. --store sqlite is working and so is KING. Does sqlite consume more memory than hdf5? Seems like that to me.

BoPeng commented 4 years ago

Yes, sqlite is the traditional way which uses more RAM and diskspace, the new storage model is more efficient but less mature now. I have identified the source of problem and is trying to fix it.

BoPeng commented 4 years ago

I have traced the problem to https://github.com/vatlab/varianttools/blob/master/src/variant_tools/exporter_reader.py#L401 function. What is happening here is that

The first reader retrieves the IDs of the table, in this case the IDs are not consecutive because we are not exporting all variants. E.g. variants 8, 20, 44
The rest of the readers get all genotypes, and ignores records with unmatched ID. When the ID is 8, it would ignores genotypes with IDs 1, 2, 3..., 7, and returns the record with ID 8.

The problem happens here when the variant has ID 58, but the genotype side returns 2, 3, 4, ..., 57, and then 59, ... so no match is found.

@jma7 Do you have on the top of your mind why this happens? I can dig deeper if you cannot recall the details here.

BoPeng commented 4 years ago

@enigmargs I have fixed the vtools export bug and released variant tools 3.1.3. Please update and feel free to let us know if you encounter any other problem.

BoPeng commented 4 years ago

I created a new bug report for the KING pipeline. #147 because it is related to an external pipeline.

vatlab / varianttools

DP_geno is not parsing genotype fields #141