sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
66 stars 13 forks source link

create-input-vcf error #101

Closed sharkLoc closed 1 year ago

sharkLoc commented 1 year ago

data type: Targeted

command: pypgx create-input-vcf --max-depth 5000 --assembly GRCh38 out.vcf.gz ../ref_hg38/part.fa Sample_JZ23.bam

error info:

Traceback (most recent call last):
  File "/home/biotools/conda/miniconda3/bin/pypgx", line 8, in <module>
    sys.exit(main())
  File "/home/biotools/conda/miniconda3/lib/python3.9/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/home/biotools/conda/miniconda3/lib/python3.9/site-packages/pypgx/cli/create_input_vcf.py", line 90, in main
    utils.create_input_vcf(
  File "/home/biotools/conda/miniconda3/lib/python3.9/site-packages/pypgx/api/utils.py", line 678, in create_input_vcf
    pyvcf.call(fasta=fasta, bams=bams, regions=bf, path=vcf, gap_frac=0,
  File "/home/biotools/conda/miniconda3/lib/python3.9/site-packages/fuc/api/pyvcf.py", line 318, in call
    bcftools.mpileup(*(args + bams), catch_stdout=False)
  File "/home/biotools/conda/miniconda3/lib/python3.9/site-packages/pysam/utils.py", line 83, in __call__
    raise SamtoolsError(
pysam.utils.SamtoolsError: 'bcftools returned with error 1: stdout=None, stderr=[E::mpileup] the sequence "chr1" not found: Sample_JZ23.bam\n'

any suggestion?

lexotero commented 1 year ago

Hello @sharkLoc,

I just came across your question. I don't know anything about genomics, but I can make an uneducated guess based on reading the code.

It would seem that create-input-vcf creates a BED file with all the regions used by PyPGx. The regions in the BED file are identified by chromosome. I'm assuming the generation of the VCF file from the FASTA and the BAM will use the information in the BED file to constrain the search. Following this assumption, if the BED file has a region in chromosome 1 (i.e. chr1), but your BAM file does not have chromosome 1 information, the process will fail.

the sequence "chr1" not found: Sample_JZ23.bam

The error message you posted seems to say something on those lines.

I hope this plus your knowledge of the input files and the field will help you track the error!

sbslee commented 1 year ago

Hi @sharkLoc,

I believe I'll require a bit more information to address the issue.

  1. What's the output of the command $ fuc bam-head Sample_JZ23.bam? The fuc package should be already installed in your PyPGx environment because it is one of the dependencies required to run PyPGx.
  2. What's the output of the command $ cat ref_hg38/part.fa | grep ">"

And a special thank you to @lexotero for joining the discussion! In theory, the command should not fail just because the input BAM file is missing a certain chromosome, as not all BAM files are generated by whole genome sequencing (e.g. targeted sequencing).

sharkLoc commented 1 year ago

1、After running the command, output nothing。 2、cat ref_hg38/part.fa | grep ">"

>chr5
>chr14
>chr22

my data is Amplicon sequencing. I was aligned reads to hg38(just chr5,14,22)

sharkLoc commented 1 year ago

你好 ,

我刚刚遇到了你的问题。我对基因组学一无所知,但我可以根据阅读代码做出未经教育的猜测。

似乎会创建一个包含 PyPGx 使用的所有区域的 BED 文件。BED文件中的区域由染色体标识。我假设从 FASTA 生成 VCF 文件,BAM 将使用 BED 文件中的信息来限制搜索。按照此假设,如果 BED 文件在染色体 1 中有一个区域(即),但您的 BAM 文件没有染色体 1 信息,则该过程将失败。create-input-vcf``chr1

the sequence "chr1" not found: Sample_JZ23.bam

您发布的错误消息似乎在这些行上说明了一些内容。

我希望这加上您对输入文件和字段的了解将帮助您跟踪错误!

Hi, @lexotero
chr1 is not included in my bed file

sbslee commented 1 year ago

@sharkLoc,

Thanks for sharing the outputs.

1、After running the command, output nothing。

This is concerning because it implies that your BAM file is lacking its header, which could potentially indicate corruption or improper formatting. How was this file generated?

If the file isn't too large, could you please share the BAM file with me at sbstevenlee@gmail.com? I need to examine the file to determine if there's a possibility of recovering it.

sharkLoc commented 1 year ago

This is concerning because it implies that your BAM file is lacking its header, which could potentially indicate corruption or improper formatting. How was this file generated?

I misunderstood you, pypgx create-input-vcf --max-depth 5000 --assembly GRCh38 out.vcf.gz .. /ref_hg38/part.fa Sample_JZ23.bam The output of this command is empty. My bam file is fine, it is generated by the bwa mem module

sbslee commented 1 year ago

@sharkLoc,

Then can you show me the output of the command $ fuc bam-head Sample_JZ23.bam? I need to see its header, specifically its contigs.

sharkLoc commented 1 year ago

@sharkLoc, Then can you show me the output of the command $ fuc bam-head Sample_JZ23.bam? I need to see its header, specifically its contigs.

output :

@HD VN:1.6  SO:coordinate
@SQ SN:chr5 LN:181538259
@SQ SN:chr14    LN:107043718
@SQ SN:chr22    LN:50818468
@RG ID:Sample_JZ23  PL:illumina SM:Sample_JZ23  LB:Sample_JZ23
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -M -t 4 -R @RG\tID:Sample_JZ23\tPL:illumina\tSM:Sample_JZ23\tLB:Sample_JZ23 /home/sharkloc/work/ref38/part.fa /home/sharkloc/work/result/0.trim/Sample_JZ23/Sample_JZ23_clean_r1.fq /home/sharkloc/work/result/0.trim/Sample_JZ23/Sample_JZ23_clean_r2.fq
@PG ID:samtools PN:samtools PP:bwa  VN:1.14 CL:samtools view -Sb -o Sample_JZ23.unsrt.bam -
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.14 CL:samtools sort -@ 4 -o Sample_JZ23.bam Sample_JZ23.unsrt.bam
sbslee commented 1 year ago

@sharkLoc,

Hmmm... I've never encountered this issue before. Would it be possible for you to share the BAM and FASTA files with me? I think I will need to tweak around the parameters of variant calling to solve this issue. Please send the files to my email above if that's OK with you. If you can't send the files, we will still try to find some ways. Please let me know.

sharkLoc commented 1 year ago

Hmmm... I've never encountered this issue before. Would it be possible for you to share the BAM and FASTA files with me? I think I will need to tweak around the parameters of variant calling to solve this issue. Please send the files to my email above if that's OK with you. If you can't send the files, we will still try to find some ways. Please let me know.

@sbslee , I am sorry that I am unable to provide my sequencing data. My data only contains capture amplification data on these three chromosomes.

part.fa:

samtools faidx hg38.fa chr5 > part.fa
samtools faidx hg38.fa chr14 >> part.fa
samtools faidx hg38.fa chr22 >> part.fa
sbslee commented 1 year ago

@sharkLoc,

No worries, I completely understand. In that case, I will use my own BAM file and try to filter out reads that were aligned to chr1 and see if I can reproduce the issue from my end. It may take some time though. I will get back to you ASAP!

sbslee commented 1 year ago

@sharkLoc,

OK, I was able to reproduce the error you reported.

The error was caused (probably) because bcftools was trying to access contigs (e.g. chr1) that are not present in your BAM file's header which only contains three contigs (chr5, chr14, and chr22).

I think this can be easily resolved by updating the header to include the other contigs using the samtools reheader command:

$ samtools reheader new_header.sam Sample_JZ23.bam > Sample_JZ23.new_header.bam
$ samtools index Sample_JZ23.new_header.bam

where new_header.sam looks like below:

@HD VN:1.6  SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10    LN:133797422
@SQ SN:chr11    LN:135086622
@SQ SN:chr12    LN:133275309
@SQ SN:chr13    LN:114364328
@SQ SN:chr14    LN:107043718
@SQ SN:chr15    LN:101991189
@SQ SN:chr16    LN:90338345
@SQ SN:chr17    LN:83257441
@SQ SN:chr18    LN:80373285
@SQ SN:chr19    LN:58617616
@SQ SN:chr20    LN:64444167
@SQ SN:chr21    LN:46709983
@SQ SN:chr22    LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@RG ID:Sample_JZ23  PL:illumina SM:Sample_JZ23  LB:Sample_JZ23
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -M -t 4 -R @RG\tID:Sample_JZ23\tPL:illumina\tSM:Sample_JZ23\tLB:Sample_JZ23 /home/sharkloc/work/ref38/part.fa /home/sharkloc/work/result/0.trim/Sample_JZ23/Sample_JZ23_clean_r1.fq /home/sharkloc/work/result/0.trim/Sample_JZ23/Sample_JZ23_clean_r2.fq
@PG ID:samtools PN:samtools PP:bwa  VN:1.14 CL:samtools view -Sb -o Sample_JZ23.unsrt.bam -
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.14 CL:samtools sort -@ 4 -o Sample_JZ23.bam Sample_JZ23.unsrt.bam

Please let me know if this doesn't work.

sharkLoc commented 1 year ago

@sharkLoc, OK, I was able to reproduce the error you reported.

Please let me know if this doesn't work.

it's worked, thanks