sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
66 stars 13 forks source link

Working with Large BAM files #141

Closed abheda24 closed 1 month ago

abheda24 commented 1 month ago

I am trying to test pypgx on 1000 genomes database, the CRAM file is around 13 GB and after converting to BAM is around 34 GB. I am trying to use create-input-vcf to generate a vcf file the files are sorted and indexed. I dont know what the issue is but i was only able to generate a file which was just 370K which is very low and the process was completed in 12 secs. I installed via pip and also downloaded the pypgx bundle, Could you please guide me?

sbslee commented 1 month ago

Hi @abheda24,

  1. Is the CRAM file you used a high-coverage (e.g., 30x) WGS sample?
  2. When running the CLI, did you make sure that the genome build is correct (GRCh37 vs. GRCh38)?
  3. What's the PyPGx version are you using?
  4. I recently published a paper where I applied PyPGx to the entire 2,504 samples from 1KGP.
abheda24 commented 1 month ago

1.Yes, Its an high coverage WGS sample (30x) from the 1000 genomes database around 13 GB. 2.Yes, i used GRCh38 which is correct. 3.The version i am using is 0.25.0 4.I will check the implementation, thanks .

sbslee commented 1 month ago

Can you share the exact CLI you used for creating the VCF and also the exact terminal output?

abheda24 commented 1 month ago

pypgx create-input-vcf \ ~/pgx_pipeline/input/NA06991-variants.vcf.gz \ ~/pgx_pipeline/input/GRCh38_full_analysis_set_plus_decoy_hla.fa \ ~/pgx_pipeline/input/NA06991.cram \ --assembly GRCh38

The output returned the files with 369KB and process was completed in 0.12 minutes

sbslee commented 1 month ago

Could you send me the output VCF?

abheda24 commented 1 month ago

NA06991-variants.vcf.gz

sbslee commented 1 month ago

Thanks. The VCF file looks fine to me. Have you tried running PyPGx on it?

abheda24 commented 1 month ago

I will run the Ngs pipeline and let you know, you can close the issue. Thanks for your response