heathsc / gemBS

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).
GNU General Public License v3.0
32 stars 21 forks source link

snp extract "Invalid format" #98

Open xiaoqiwang19 opened 1 year ago

xiaoqiwang19 commented 1 year ago

When initally running gemBS extract, I do not see any variants in the _snps.txt.gz files. The he _snps.txt.gz file is empty. I execute the following commands to extract wgbs samples SNPs info: gemBS extract --snp-db /public/backup/users/wangxq/software/annovar/humandb/dbsnp_138.hg38.vcf.gz.idx -S -c -N -B -t 32 But I get the following erros: "Loading dbSNP header from /public/backup/users/wangxq/software/annovar/humandb/dbsnp_138.hg38.vcf.gz.idx Invalid format" Whether it's a dbsnp downloaded from NCBI or a dbsnp database downloaded from humandb. First, the index is built with gembs, and then the format error is displayed when running gemBS extract command. I don't know what format is required for this and why the error is reported. I need your help, thank you.

heathsc commented 1 year ago

The dbsnp index used by gemBS has its own format which is quite unlike the formats from the public databases. You need to download the VCF or BED files from dbSNP and then use the index subcommand in gemBS to generate the index.

Simon

On Thu, Feb 2, 2023 at 8:14 AM xiaoqiwang19 @.***> wrote:

When initally running gemBS extract, I do not see any variants in the _snps.txt.gz files. The he _snps.txt.gz file is empty. I execute the following commands to extract wgbs samples SNPs info: gemBS extract --snp-db /public/backup/users/wangxq/software/annovar/humandb/dbsnp_138.hg38.vcf.gz.idx -S -c -N -B -t 32 But I get the following erros: "Loading dbSNP header from /public/backup/users/wangxq/software/annovar/humandb/dbsnp_138.hg38.vcf.gz.idx Invalid format" Whether it's a dbsnp downloaded from NCBI or a dbsnp database downloaded from humandb. First, the index is built with gembs, and then the format error is displayed when running gemBS extract command. I don't know what format is required for this and why the error is reported. I need your help, thank you.

— Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/98, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4652MQS6UFMAVVVBTJ53WVNNF5ANCNFSM6AAAAAAUOUSR2E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

xiaoqiwang19 commented 1 year ago

I downloaded dbsnp from NCBI and built the index using gembs index. Can you give me a example of dbsnp or url ? By the way, does the snp result include reference homozygous snp calls?

heathsc commented 1 year ago

I would recommend downloading the VCF format files from dbSNP for example https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz

It is simplest to then add the following to the config file (and re-run gemBS prepare)

dbSBP_files = /path/to/file/GCF_000001405.40.gz

and then run gemBS index

To get homozygous reference SNPs it is necessary to rerun the calling step after you have configured gemBS for dbSNP, otherwise only variant SNPs (or SNPs with C or G alleles) will be included. If you have already generated the BCF files you should move or remove them so that gemBS will redo the calling step.

Simon

On Thu, Feb 2, 2023 at 9:13 AM xiaoqiwang19 @.***> wrote:

I downloaded dbsnp from NCBI and built the index using gembs index. Can you give me a example of dbsnp or url ? By the way, does the snp result include reference homozygous snp calls?

— Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/98#issuecomment-1413313432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY46534PFR5MEQ4TA7DAWTWVNUD5ANCNFSM6AAAAAAUOUSR2E . You are receiving this because you commented.Message ID: @.***>

xiaoqiwang19 commented 1 year ago

Thank you very much. I followed your advice and rerun the calling step after having configured gemBS for dbSNP. Then I run the commmand of gemBS extract . But I do not see any variants in the _snps.txt.gz files. The he _snps.txt.gz file is still empty. The snpxtr_sample.err file displays that the task is finished and bcf file is normal. I want to extract snp results, what should I do?