sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
64 stars 11 forks source link

Issue while running snp chip tutorial. #68

Closed ngslearner closed 2 years ago

ngslearner commented 2 years ago

Hi, I am new to this and trying to run tutorial for snp chip pipeline but the attached error is occuring. Can you look into this.? Thanks in advanced.

sbslee commented 2 years ago

@ngslearner,

Thanks for the question! That's strange, as you can see below I'm able to run the same command without any issues:

(fuc) sbslee@Seung-beens-MacBook-Air Desktop % mkdir coriell-affy-tutorial
(fuc) sbslee@Seung-beens-MacBook-Air Desktop % cd coriell-affy-tutorial
(fuc) sbslee@Seung-beens-MacBook-Air coriell-affy-tutorial % wget https://raw.githubusercontent.com/sbslee/pypgx-data/main/coriell-affy-tutorial/variants.vcf.gz 
--2022-08-03 17:58:19--  https://raw.githubusercontent.com/sbslee/pypgx-data/main/coriell-affy-tutorial/variants.vcf.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84576 (83K) [application/octet-stream]
Saving to: ‘variants.vcf.gz’

variants.vcf.gz     100%[===================>]  82.59K  --.-KB/s    in 0.04s   

2022-08-03 17:58:20 (1.88 MB/s) - ‘variants.vcf.gz’ saved [84576/84576]

(fuc) sbslee@Seung-beens-MacBook-Air coriell-affy-tutorial % wget https://raw.githubusercontent.com/sbslee/pypgx-data/main/coriell-affy-tutorial/variants.vcf.gz.tbi
--2022-08-03 17:58:24--  https://raw.githubusercontent.com/sbslee/pypgx-data/main/coriell-affy-tutorial/variants.vcf.gz.tbi
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4171 (4.1K) [application/octet-stream]
Saving to: ‘variants.vcf.gz.tbi’

variants.vcf.gz.tbi 100%[===================>]   4.07K  --.-KB/s    in 0s      

2022-08-03 17:58:25 (16.4 MB/s) - ‘variants.vcf.gz.tbi’ saved [4171/4171]

(fuc) sbslee@Seung-beens-MacBook-Air coriell-affy-tutorial % pypgx run-chip-pipeline \
CYP3A5 \
CYP3A5-pipeline \
variants.vcf.gz \
--impute \
--force
Saved VcfFrame[Imported] to: CYP3A5-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: CYP3A5-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: CYP3A5-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: CYP3A5-pipeline/alleles.zip
Saved SampleTable[Genotypes] to: CYP3A5-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: CYP3A5-pipeline/phenotypes.zip
Saved SampleTable[Results] to: CYP3A5-pipeline/results.zip

One thing I noticed from the screenshot is the error message: TypeError: sort_values() got an unexpected keyword argument 'ignore_index'. This seems to indicate that you're running a lower version of the pandas package, which is one of the required dependencies of PyPGx. This is because the pandas.DataFrame.sort_values method does have the argument called ignore_index (see the docs). Therefore, the error message makes sense if you are running a lower version of pandas that does not have the ignore_index yet.

It also appears that you are not using any virtual environment. It's strongly recommended to create a virtual environment for setting up PyPGx (or for any other bioinformatics tool for that matter). You can do this with python3 -m venv myenv for Python or with conda create -n myenv for Anaconda. After activating the virtual environment, you can then install your packages.

In any case, please try to install the latest version of pandas and run PyPGx again. Also, and this is very important, please make sure you are also using the latest version of the fuc and pypgx packages (0.35.0 and 0.17.0, respectively). If re-installing pandas doesn't solve the problem, and you are sure the fuc and pypgx packages are up to date, please let me know.

ngslearner commented 2 years ago

Wow, its working now. Appreciate the work you guys are doing. Thank you so much for your quick response.

ngslearner commented 2 years ago

Hi, Sorry to bother you again. I am trying to run Illumina gsa chip data vcf file. But it was showing error as:

~/myenv/trial_pyogx$ pypgx run-chip-pipeline CYP2D6 CYP2D6-pipeline NMCG213404.vcf.gz --force --impute
Saved VcfFrame[Imported] to: CYP2D6-pipeline/imported-variants.zip
ERROR: REF field is not a sequence of A, C, T, G, or N characters at 22:42524203 [I]
Traceback (most recent call last):
  File "/home/genique/.local/bin/pypgx", line 8, in <module>
    sys.exit(main())
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/cli/run_chip_pipeline.py", line 92, in main
    pipeline.run_chip_pipeline(
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/api/pipeline.py", line 64, in run_chip_pipeline
    phased_variants = utils.estimate_phase_beagle(
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/api/utils.py", line 861, in estimate_phase_beagle
    subprocess.run(command, check=True, stdout=subprocess.DEVNULL)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['java', '-Xmx2g', '-jar', '/home/genique/.local/lib/python3.8/site-packages/pypgx/api/beagle.28Jun21.220.jar', 'gt=/tmp/tmpjtcdd9l3/input.vcf', 'chrom=22:42512500-42551883', 'ref=/home/genique/pypgx-bundle/1kgp/GRCh37/CYP2D6.vcf.gz', 'out=/tmp/tmpjtcdd9l3/output', 'impute=false']' returned non-zero exit status 1.

So, I tried to remove the I/D alleles from the data using command:

awk '$1 ~ /^#/ {print $0;next} {if ($4 ~ /A|C|T|G/ && $5 ~ /.|A|C|T|G/) print $0}' NMCG213404.vcf > filtered.vcf

It removes all the I/D alleles present in vcf file. But it is again encountering error:

~/myenv/trial_pyogx$ pypgx run-chip-pipeline CYP2D6 CYP2D6-pipeline filtered.vcf.gz --impute --force 
Saved VcfFrame[Imported] to: CYP2D6-pipeline/imported-variants.zip
Traceback (most recent call last):
  File "/home/genique/.local/bin/pypgx", line 8, in <module>
    sys.exit(main())
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/cli/run_chip_pipeline.py", line 92, in main
    pipeline.run_chip_pipeline(
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/api/pipeline.py", line 64, in run_chip_pipeline
    phased_variants = utils.estimate_phase_beagle(
  File "/home/genique/.local/lib/python3.8/site-packages/pypgx/api/utils.py", line 861, in estimate_phase_beagle
    subprocess.run(command, check=True, stdout=subprocess.DEVNULL)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['java', '-Xmx2g', '-jar', '/home/genique/.local/lib/python3.8/site-packages/pypgx/api/beagle.28Jun21.220.jar', 'gt=/tmp/tmpm_yxdqw6/input.vcf', 'chrom=22:42512500-42551883', 'ref=/home/genique/pypgx-bundle/1kgp/GRCh37/CYP2D6.vcf.gz', 'out=/tmp/tmpm_yxdqw6/output', 'impute=false']' returned non-zero exit status 1.

Can you please look into this. PS. This error is occuring for some of the genes like CYP2D6, CYP2C19, CYP2C8, ABCB1, SLCO1B1, UGT1A1 etc. Rest are working fine.

sbslee commented 2 years ago

@ngslearner,

No worries! Is there any chance you could send me the NMCG213404.vcf.gz file (sbstevenlee@gmail.com)? In order for me to solve the issue, I need at least a minimal example that reproduces the problem. Actually, it doesn't even have to be the original VCF file; a imported-variants.zip file from one of the PyPGx runs that gave you the error will also suffice.

ngslearner commented 2 years ago

Thank you for your response. I have sent you the file on your mail.

sbslee commented 2 years ago

@ngslearner,

Thank you so much for sending the files! I have been able to pinpoint the problem.

Even though your VCF contains many variants, most of them are I/D alleles as you have pointed out already or they are missing ALT alleles ('.'). Normally, this isn't an issue for PyPGx.

In this case, it appears you ran into multiple situations where, for a given gene, there were 0 overlapping variants between input VCF and reference haplotype panel. This is similar to the bug reported in #63 where the Beagle program threw error when there was only single overlapping variant between input VCF and reference panel.

Note that this is different from the situation where input VCF is completely empty for the given gene to begin with. In your case, the input VCF does contain many variants but they simply don't overlap with the reference panel at all, causing the Beagle program to throw an error.

Therefore, I fixed the problem by explicitly handling above situation. Basically, for such cases PyPGx will skip statistical phasing and fall back to using the phase-extension algorithm. This also means your --impute argument has no effect because imputation requires statistical phasing.

Please see below to see how the updated version works with your unfiltered, original VCF file:

(fuc) sbslee@Seung-beens-MacBook-Air for-sky % pypgx run-chip-pipeline \
CYP2D6 \
CYP2D6-pipeline \
sample.vcf.gz \
--force \
--impute
Saved VcfFrame[Imported] to: CYP2D6-pipeline/imported-variants.zip
/Users/sbslee/Desktop/pypgx/pypgx/api/utils.py:846: UserWarning: 0 overlapping variants, skip statistical phasing
  warnings.warn("0 overlapping variants, skip statistical phasing")
Saved VcfFrame[Phased] to: CYP2D6-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: CYP2D6-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: CYP2D6-pipeline/alleles.zip
/Users/sbslee/Desktop/pypgx/pypgx/api/genotype.py:665: UserWarning: The user did not provide CNV calls even though the target gene is known to have SV. PyPGx will assume all of the samples do not have SV.
  warnings.warn(message)
Saved SampleTable[Genotypes] to: CYP2D6-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: CYP2D6-pipeline/phenotypes.zip
Saved SampleTable[Results] to: CYP2D6-pipeline/results.zip

I will send the files to your email. Thanks again for reporting the issue. Please let me know if you have additional questions.

P.S. In order to use the updated PyPGx version, please follow the instruction below -- i.e. you need to install the dev version locally. This update has been implemented to the 0.18.0-dev branch and it will be officially released in 1-2 weeks.

$ git clone https://github.com/sbslee/pypgx
$ cd pypgx
$ git checkout 0.18.0-dev
$ pip install .
sbslee commented 2 years ago

Copied and pasted user reply:

Thank you so much for your response and handling the issue. I checked the newer version for all the genes present in the pypgx bundle out of all some of the genes were encountering issues. For genes ABCB1 and CYP2C19 it is showing the same issue as before. For genes CACNA1S, CFTR, CYP2C9, CYP3A5, UGT2B7, UGT2B15, DPYD, NAT1 it worked after filtering. Gene G6PD has error invalid contig 'X' Gene POR has an error saying Attribute Error: 'float' object has no attribute 'split'. Once again thank you for resolving my doubts.

sbslee commented 2 years ago

For genes CACNA1S, CFTR, CYP2C9, CYP3A5, UGT2B7, UGT2B15, DPYD, NAT1 it worked after filtering.

It turns out the GSA-specific alleles (N, I, D, .) still need to be dealt with. Therefore, I updated PyPGx to handle those alleles if the input VCF is from chip data. After this update, it worked just fine as you can see below:

(fuc) sbslee@Seung-beens-Mac-mini for-sky % pypgx run-chip-pipeline \
CYP2C9 \
CYP2C9-pipeline \
sample.vcf.gz \
--force \
--impute
/Users/sbslee/Desktop/pypgx/pypgx/api/utils.py:1016: UserWarning: Input VCF contains duplicate variants sharing the same ['CHROM', 'POS', 'REF', 'ALT'] values. Will drop duplicates except for the first occurrence:
CHROM      POS                                            ID REF ALT QUAL FILTER INFO FORMAT 1_206544490050_R04C02
   10 96701970 NC_000010.11:g.94942213_94942222delAGAAATGGAA   I   .    .      .   PR     GT                   0/0
   10 96701970                                    rs72558188   I   .    .      .   PR     GT                   0/0
   10 96741058                                    rs28371686   C   .    .      .   PR     GT                   0/0
   10 96741058                                  rs28371686.1   C   .    .      .   PR     GT                   0/0
   10 96741058                                  rs28371686.2   C   .    .      .   PR     GT                   0/0
  warnings.warn("Input VCF contains duplicate variants sharing the "
Saved VcfFrame[Imported] to: CYP2C9-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: CYP2C9-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: CYP2C9-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: CYP2C9-pipeline/alleles.zip
Saved SampleTable[Genotypes] to: CYP2C9-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: CYP2C9-pipeline/phenotypes.zip
Saved SampleTable[Results] to: CYP2C9-pipeline/results.zip

For genes ABCB1 and CYP2C19 it is showing the same issue as before.

As you pointed out, even after removing the GSA-specific alleles, the two genes still gave an error produced from Beagle. I still don't know the cause of the problem, but upgrading Beagle from v5.2 to v5.4 fixed the problem. So starting pypgx-0.18.0, PyPGx will use v5.4 Beagle:

(fuc) sbslee@Seung-beens-Mac-mini for-sky % pypgx run-chip-pipeline \
ABCB1 \
ABCB1-pipeline \
sample.vcf.gz \
--force \
--impute
/Users/sbslee/Desktop/pypgx/pypgx/api/utils.py:1016: UserWarning: Input VCF contains duplicate variants sharing the same ['CHROM', 'POS', 'REF', 'ALT'] values. Will drop duplicates except for the first occurrence:
CHROM      POS          ID REF ALT QUAL FILTER INFO FORMAT 1_206544490050_R04C02
    7 87179809   rs2229109   G   .    .      .   PR     GT                   0/0
    7 87179809 rs2229109.2   G   .    .      .   PR     GT                   0/0
    7 87230107  7:87230107   A   .    .      .   PR     GT                   0/0
    7 87230107  rs28381801   A   .    .      .   PR     GT                   0/0
  warnings.warn("Input VCF contains duplicate variants sharing the "
Saved VcfFrame[Imported] to: ABCB1-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: ABCB1-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: ABCB1-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: ABCB1-pipeline/alleles.zip
Saved SampleTable[Genotypes] to: ABCB1-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: ABCB1-pipeline/phenotypes.zip
Saved SampleTable[Results] to: ABCB1-pipeline/results.zip

Gene G6PD has error invalid contig 'X'

This problem was caused because your VCF file uses the contig name 23 for denoting the X chromosome. There's nothing PyPGx can do about it, so users must rename the contig to X or chrX instead of 23. For instance, below I manually replaced 23 to X in your VCF file (this new file was named sample-X.vcf.gz) and PyPGx ran just fine:

(fuc) sbslee@Seung-beens-Mac-mini for-sky % pypgx run-chip-pipeline \
G6PD \
G6PD-pipeline \
sample-X.vcf.gz \
--force \
--impute
[W::vcf_parse_info] INFO 'PR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GT' at 0:0 is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'PR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GT' at X:153759644 is not defined in the header, assuming Type=String
/Users/sbslee/Desktop/pypgx/pypgx/api/utils.py:1016: UserWarning: Input VCF contains duplicate variants sharing the same ['CHROM', 'POS', 'REF', 'ALT'] values. Will drop duplicates except for the first occurrence:
CHROM       POS              ID REF ALT QUAL FILTER INFO FORMAT 1_206544490050_R04C02
    X 153760472      rs72554664   G   .    .      .   PR     GT                   0/0
    X 153760472       VGXS34761   G   .    .      .   PR     GT                   0/0
    X 153760484      rs72554665   C   .    .      .   PR     GT                   0/0
    X 153760484    rs72554665.1   C   .    .      .   PR     GT                   0/0
    X 153760626     rs137852317   G   .    .      .   PR     GT                   0/0
    X 153760626 X:153760626-C-T   G   .    .      .   PR     GT                   0/0
    X 153760654     kgp31004879   A   .    .      .   PR     GT                   0/0
    X 153760654       rs2230037   A   .    .      .   PR     GT                   0/0
    X 153760909     rs137852321   G   .    .      .   PR     GT                   0/0
    X 153760909 X:153760909-C-T   G   .    .      .   PR     GT                   0/0
    X 153760980     rs137852329   C   .    .      .   PR     GT                   0/0
    X 153760980 X:153760980-G-T   C   .    .      .   PR     GT                   0/0
    X 153762634      exm1666749   G   .    .      .   PR     GT                   0/0
    X 153762634       rs5030868   G   .    .      .   PR     GT                   0/0
    X 153762655      exm1666751   A   .    .      .   PR     GT                   0/0
    X 153762655 JHU_X.153762654   A   .    .      .   PR     GT                   0/0
    X 153762655       rs5030872   A   .    .      .   PR     GT                   0/0
    X 153762704     rs137852331   A   .    .      .   PR     GT                   0/0
    X 153762704 X:153762704-T-C   A   .    .      .   PR     GT                   0/0
    X 153762710     rs137852314   G   .    .      .   PR     GT                   0/0
    X 153762710 X:153762710-C-T   G   .    .      .   PR     GT                   0/0
    X 153763492 JHU_X.153763491   A   .    .      .   PR     GT                   0/0
    X 153763492       rs1050829   A   .    .      .   PR     GT                   0/0
    X 153764217      exm1666767   G   .    .      .   PR     GT                   0/0
    X 153764217 JHU_X.153764216   G   .    .      .   PR     GT                   0/0
    X 153764217     kgp30626516   G   .    .      .   PR     GT                   0/0
    X 153764217       rs1050828   G   .    .      .   PR     GT                   0/0
    X 153764247     rs137852315   G   .    .      .   PR     GT                   0/0
    X 153764247 X:153764247-C-T   G   .    .      .   PR     GT                   0/0
  warnings.warn("Input VCF contains duplicate variants sharing the "
Saved VcfFrame[Imported] to: G6PD-pipeline/imported-variants.zip
/Users/sbslee/Desktop/pypgx/pypgx/api/utils.py:852: UserWarning: 0 overlapping variants, skip statistical phasing
  warnings.warn("0 overlapping variants, skip statistical phasing")
Saved VcfFrame[Phased] to: G6PD-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: G6PD-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: G6PD-pipeline/alleles.zip
/Users/sbslee/Desktop/pypgx/pypgx/api/genotype.py:665: UserWarning: The user did not provide CNV calls even though the target gene is known to have SV. PyPGx will assume all of the samples do not have SV.
  warnings.warn(message)
Saved SampleTable[Genotypes] to: G6PD-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: G6PD-pipeline/phenotypes.zip
Saved SampleTable[Results] to: G6PD-pipeline/results.zip

Gene POR has an error saying Attribute Error: 'float' object has no attribute 'split'.

This problem was caused because your VCF file has many duplicate variants. So I updated PyPGx to automatically detect duplicates and only keep the first occurrence. This solved the issue:

(fuc) sbslee@Seung-beens-Mac-mini for-sky % pypgx run-chip-pipeline \
POR \
POR-pipeline \
sample.vcf.gz \
--force \
--impute
/Users/sbslee/Desktop/pypgx/pypgx/api/utils.py:1016: UserWarning: Input VCF contains duplicate variants sharing the same ['CHROM', 'POS', 'REF', 'ALT'] values. Will drop duplicates except for the first occurrence:
CHROM      POS           ID REF ALT QUAL FILTER INFO FORMAT 1_206544490050_R04C02
    7 75544212 rs12537282.1   G   .    .      .   PR     GT                   0/0
    7 75544212 rs12537282.2   G   .    .      .   PR     GT                   0/0
    7 75609677     26367G>A   A   G    .      .   PR     GT                   0/1
    7 75609677    rs1135612   A   G    .      .   PR     GT                   0/1
  warnings.warn("Input VCF contains duplicate variants sharing the "
Saved VcfFrame[Imported] to: POR-pipeline/imported-variants.zip
/Users/sbslee/Desktop/pypgx/pypgx/api/utils.py:855: UserWarning: Only 1 overlapping variant, skip statistical phasing
  warnings.warn("Only 1 overlapping variant, skip statistical phasing")
Saved VcfFrame[Phased] to: POR-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: POR-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: POR-pipeline/alleles.zip
Saved SampleTable[Genotypes] to: POR-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: POR-pipeline/phenotypes.zip
Saved SampleTable[Results] to: POR-pipeline/results.zip

If you want to try this yourself, please install the latest 0.18.0-dev branch of pypgx locally again. ALSO, you need to install the latest 0.36.0-dev branch of fuc locally (pypgx depends on fuc):

$ git clone https://github.com/sbslee/fuc
$ cd fuc
$ git checkout 0.36.0-dev
$ pip install .
ngslearner commented 2 years ago

Thank you so much @sbslee for your patient replies and resolving my queries. I ran it on my end and its working properly now.