hringbauer / ancIBD

Detecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.
GNU General Public License v3.0
9 stars 3 forks source link

Merging in LD Map.. results in error #8

Closed batelz closed 1 year ago

batelz commented 1 year ago

Hi again. This issue can be merged with #5.

I've used GLIMPSE to impute two samples, and resulted with the following vcf file:

(base) batel:~/glimpse/Levant_BA$ bcftools view merged.vcf.gz | head -20
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=01/08/2023 - 11:43:29
##source=GLIMPSE_phase v2.0.0
##contig=<ID=chr22>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=INFO,Number=A,Type=Float,Description="IMPUTE info quality score for diploid samples">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##NMAIN=15
##FPLOIDY=2
##bcftools_mergeVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_mergeCommand=merge -l GLIMPSE_ligate/samples_merge.txt -o merged.vcf.gz; Date=Tue Aug  1 13:03:00 2023
##bcftools_viewVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_viewCommand=view merged.vcf.gz; Date=Tue Aug  1 13:04:29 2023
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S10769.SN   S10770.SN
chr22   10519276    22:10519276:G:C G   C   .   .   RAF=0.0377889;AF=0.0572749;INFO=1   GT:DS:GP    0|0:0.106:0.896,0.101,0.003 0|0:0.115:0.888,0.108,0.004
chr22   10519389    22:10519389:T:C T   C   .   .   RAF=0.0376327;AF=0.0567751;INFO=1   GT:DS:GP    0|0:0.104:0.898,0.098,0.004 0|0:0.114:0.889,0.107,0.004

When trying to convert to HD5 using

from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf
ch = 22
#base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
    in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
    path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
                 path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                 marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                 map_path = f"./data/v51.1_1240k.snp",
                 af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                 col_sample_af = "",
                 buffer_size=20000, chunk_width=8, chunk_length=20000,
                 ch=ch)

I get the following error:

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_7983/2495295663.py in <module>
      2 ch = 22
      3 #base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
----> 4 vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
      5     in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
      6     path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
    125     print("Merging in LD Map..")
    126     if len(map_path)>0:
--> 127         merge_in_ld_map(path_h5=path_h5, 
    128                     path_snp1240k=map_path,
    129                     chs=[ch])

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/h5_modify.py in merge_in_ld_map(path_h5, path_snp1240k, chs, write_mode)
    115     chs: Which Chromosomes to merge in HDF5 [list].
    116     write_mode: Which mode to use on hdf5. a: New field. r+: Change Field"""
--> 117     with h5py.File(path_h5, "r") as f:
    118         print("Lifting LD Map from eigenstrat to HDF5...")
    119         print("Loaded %i variants." % np.shape(f["calldata/GT"])[0])

~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, **kwds)
    531                                  fs_persist=fs_persist, fs_threshold=fs_threshold,
    532                                  fs_page_size=fs_page_size)
--> 533                 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    534 
    535             if isinstance(libver, tuple):

~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    224         if swmr and swmr_support:
    225             flags |= h5f.ACC_SWMR_READ
--> 226         fid = h5f.open(name, flags, fapl=fapl)
    227     elif mode == 'r+':
    228         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './data/hdf5/example_hazelton_chr22.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

The directory does exist, as I get no error when running: os.listdir("./data/hdf5/") and I've also tried writing the absolute path.

Moreover, when running the example you provided -- just by switching the in_vcf it runs perfectly fine. So it must be something in the vcf file iteslf.

Any ideas?

Batel

hringbauer commented 1 year ago

It seems like the output hdf5 file is never produced, and then the last steps (adding map position and allele frequencies to this file) fail because the program cannot find this file.

So the error occurs before that - either when producing the 1240k-filtered VCF or the initial output hdf5 file.

Is the intermediate 1240k VCF produced? It should be at f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz"

batelz commented 1 year ago

The file is there, but it's empty:

(base):~/ancIBD/data/vcf.1240k$ bcftools view example_hazelton_chr22.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=01/08/2023 - 11:43:29
##source=GLIMPSE_phase v2.0.0
##contig=<ID=chr22>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=INFO,Number=A,Type=Float,Description="IMPUTE info quality score for diploid samples">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##NMAIN=15
##FPLOIDY=2
##bcftools_mergeVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_mergeCommand=merge -l GLIMPSE_ligate/samples_merge.txt -o merged.vcf.gz; Date=Tue Aug  1 13:03:00 2023
##bcftools_annotateVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_annotateCommand=annotate -x ^FORMAT/GT,FORMAT/GP -o GLIMPSE_ligate/merged_chr22_GT_GP.vcf.gz merged.vcf.gz; Date=Tue Aug  1 13:04:43 2023
##bcftools_viewVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_viewCommand=view -Oz -o ./data/vcf.1240k/example_hazelton_chr22.vcf.gz -T ./data/filters/snps_bcftools_ch22.csv -M2 -v snps ./data/vcf.raw/merged_chr22_GT_GP.vcf.gz; Date=Tue Aug  1 13:28:53 2023
##bcftools_viewCommand=view example_hazelton_chr22.vcf.gz; Date=Mon Aug 21 13:20:39 2023
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S10769.SN       S10770.SN
hringbauer commented 1 year ago

Then I would recommend manually running the bcftools command that creates that intermediary 1240k VCF - and trouble-shoot what goes wrong (nothing matches the SNP filter is the obvious first thing to look into).

Indeed, I see from the above that you have chr22notation, but the filter file has 22 only (without the chr). You should transform your VCF accordingly to match the latter notation.