sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
65 stars 12 forks source link

Error finding file in ngs run #105

Closed muhligs closed 1 year ago

muhligs commented 1 year ago

I first want to thank you for a great job with building and documenting the pypgx software. Really amazing and appreciated!!

I write because I get an error when running the ngs pipeline:

#################################

pypgx run-ngs-pipeline \
CYP2D6 \
test1_CYP2D6-pipeline \
--variants test1_variants.vcf.gz \
--depth-of-coverage test1_depth-of-coverage.zip \
--control-statistics test1_control-statistics-VDR.zip \
--assembly GRCh38

results in: FileNotFoundError: [Errno 2] No such file or directory: 'test1_CYP2D6-pipeline/allele-fraction-profile//faststorage/project/pharmgene/get-rm/bam/hg38/ERR1955330/ERR1955330.aligned.sorted.markdup.bam.png' please note the double '//' in this path. #################################

The preceding commands to construct the inputs are: #################################

conda activate pypgx # v. 0.21.0
# make an input vcf
pypgx create-input-vcf \
test1_variants.vcf.gz \
/faststorage/project/MomaReference/BACKUP/hg38/reference_genome/GCA_000001405.15_GRCh38_no_alt_analysis_set.GRC_exclusions_masked.fna \
/faststorage/project/pharmgene/get-rm/bam/hg38/ERR1955330/ERR1955330.aligned.sorted.markdup.bam \
--assembly GRCh38
# compute depth of coverage:
pypgx prepare-depth-of-coverage \
test1_depth-of-coverage.zip \
/faststorage/project/pharmgene/get-rm/bam/hg38/ERR1955330/ERR1955330.aligned.sorted.markdup.bam \
--assembly GRCh38 \
--genes GSTT1 \
--exclude

# compute control statistics for VDR gene
pypgx compute-control-statistics \
VDR \
test1_control-statistics-VDR.zip \
/faststorage/project/pharmgene/get-rm/bam/hg38/ERR1955330/ERR1955330.aligned.sorted.markdup.bam \
--assembly GRCh38

#################################

I hope you are able to help. My first thought was that maybe the path to the bam file is somehow not parseable by the software, given the '//' in the not found file path.

Thanks for your time.

sbslee commented 1 year ago

@muhligs,

Thanks for the kind words!

As for your issue, yes, it probably has to do with the // business. It appears that the entire BAM path was set as the sample name because the BAM header was missing the sample name tag (SM). This issue was recently raised by another user who replied in #96:

@hongsheng-lai,

Thanks for sending the files. The problem was caused because the sample name in your VCF file hprc.HG002.index_hs37d5.bwa_pypgx.vcf.gz (Name = /volume/pgxdata/hprc/bam/hprc.HG002.index_hs37d5.bwa.bam) is different from that of the copy-number.zip file (Name = hprc.HG002.index_hs37d5.bwa).

I've never seen this issue before, but my guess is that the sample name tag (SM) in your BAM is not properly set and that's why the file path or file name was used to represent sample name.

What's the output of:

$ fuc bam-head /volume/pgxdata/hprc/bam/hprc.HG002.index_hs37d5.bwa.bam

The fuc package should have been installed when you installed PyPGx. You could also simply use samtools:

$ samtools view -H /volume/pgxdata/hprc/bam/hprc.HG002.index_hs37d5.bwa.bam

Q1. Could you please share with me via email the three input files so that we can be sure the sample name is indeed causing the problem?

test1_variants.vcf.gz
test1_depth-of-coverage.zip
test1_control-statistics-VDR.zip

Q2. Could you show me the output of the below command (i.e. BAM header)?

$ fuc bam-head /faststorage/project/pharmgene/get-rm/bam/hg38/ERR1955330/ERR1955330.aligned.sorted.markdup.bam

The fuc package should have been installed when you installed PyPGx. You could also simply use samtools:

$ samtools view -H /faststorage/project/pharmgene/get-rm/bam/hg38/ERR1955330/ERR1955330.aligned.sorted.markdup.bam
muhligs commented 1 year ago

Hi again, I will send the files. Here is the output of the command:

@HD VN:1.6  SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10    LN:133797422
@SQ SN:chr11    LN:135086622
@SQ SN:chr12    LN:133275309
@SQ SN:chr13    LN:114364328
@SQ SN:chr14    LN:107043718
@SQ SN:chr15    LN:101991189
@SQ SN:chr16    LN:90338345
@SQ SN:chr17    LN:83257441
@SQ SN:chr18    LN:80373285
@SQ SN:chr19    LN:58617616
@SQ SN:chr20    LN:64444167
@SQ SN:chr21    LN:46709983
@SQ SN:chr22    LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:chr1_KI270706v1_random   LN:175055
@SQ SN:chr1_KI270707v1_random   LN:32032
@SQ SN:chr1_KI270708v1_random   LN:127682
@SQ SN:chr1_KI270709v1_random   LN:66860
@SQ SN:chr1_KI270710v1_random   LN:40176
@SQ SN:chr1_KI270711v1_random   LN:42210
@SQ SN:chr1_KI270712v1_random   LN:176043
@SQ SN:chr1_KI270713v1_random   LN:40745
@SQ SN:chr1_KI270714v1_random   LN:41717
@SQ SN:chr2_KI270715v1_random   LN:161471
@SQ SN:chr2_KI270716v1_random   LN:153799
@SQ SN:chr3_GL000221v1_random   LN:155397
@SQ SN:chr4_GL000008v2_random   LN:209709
@SQ SN:chr5_GL000208v1_random   LN:92689
@SQ SN:chr9_KI270717v1_random   LN:40062
@SQ SN:chr9_KI270718v1_random   LN:38054
@SQ SN:chr9_KI270719v1_random   LN:176845
@SQ SN:chr9_KI270720v1_random   LN:39050
@SQ SN:chr11_KI270721v1_random  LN:100316
@SQ SN:chr14_GL000009v2_random  LN:201709
@SQ SN:chr14_GL000225v1_random  LN:211173
@SQ SN:chr14_KI270722v1_random  LN:194050
@SQ SN:chr14_GL000194v1_random  LN:191469
@SQ SN:chr14_KI270723v1_random  LN:38115
@SQ SN:chr14_KI270724v1_random  LN:39555
@SQ SN:chr14_KI270725v1_random  LN:172810
@SQ SN:chr14_KI270726v1_random  LN:43739
@SQ SN:chr15_KI270727v1_random  LN:448248
@SQ SN:chr16_KI270728v1_random  LN:1872759
@SQ SN:chr17_GL000205v2_random  LN:185591
@SQ SN:chr17_KI270729v1_random  LN:280839
@SQ SN:chr17_KI270730v1_random  LN:112551
@SQ SN:chr22_KI270731v1_random  LN:150754
@SQ SN:chr22_KI270732v1_random  LN:41543
@SQ SN:chr22_KI270733v1_random  LN:179772
@SQ SN:chr22_KI270734v1_random  LN:165050
@SQ SN:chr22_KI270735v1_random  LN:42811
@SQ SN:chr22_KI270736v1_random  LN:181920
@SQ SN:chr22_KI270737v1_random  LN:103838
@SQ SN:chr22_KI270738v1_random  LN:99375
@SQ SN:chr22_KI270739v1_random  LN:73985
@SQ SN:chrY_KI270740v1_random   LN:37240
@SQ SN:chrUn_KI270302v1 LN:2274
@SQ SN:chrUn_KI270304v1 LN:2165
@SQ SN:chrUn_KI270303v1 LN:1942
@SQ SN:chrUn_KI270305v1 LN:1472
@SQ SN:chrUn_KI270322v1 LN:21476
@SQ SN:chrUn_KI270320v1 LN:4416
@SQ SN:chrUn_KI270310v1 LN:1201
@SQ SN:chrUn_KI270316v1 LN:1444
@SQ SN:chrUn_KI270315v1 LN:2276
@SQ SN:chrUn_KI270312v1 LN:998
@SQ SN:chrUn_KI270311v1 LN:12399
@SQ SN:chrUn_KI270317v1 LN:37690
@SQ SN:chrUn_KI270412v1 LN:1179
@SQ SN:chrUn_KI270411v1 LN:2646
@SQ SN:chrUn_KI270414v1 LN:2489
@SQ SN:chrUn_KI270419v1 LN:1029
@SQ SN:chrUn_KI270418v1 LN:2145
@SQ SN:chrUn_KI270420v1 LN:2321
@SQ SN:chrUn_KI270424v1 LN:2140
@SQ SN:chrUn_KI270417v1 LN:2043
@SQ SN:chrUn_KI270422v1 LN:1445
@SQ SN:chrUn_KI270423v1 LN:981
@SQ SN:chrUn_KI270425v1 LN:1884
@SQ SN:chrUn_KI270429v1 LN:1361
@SQ SN:chrUn_KI270442v1 LN:392061
@SQ SN:chrUn_KI270466v1 LN:1233
@SQ SN:chrUn_KI270465v1 LN:1774
@SQ SN:chrUn_KI270467v1 LN:3920
@SQ SN:chrUn_KI270435v1 LN:92983
@SQ SN:chrUn_KI270438v1 LN:112505
@SQ SN:chrUn_KI270468v1 LN:4055
@SQ SN:chrUn_KI270510v1 LN:2415
@SQ SN:chrUn_KI270509v1 LN:2318
@SQ SN:chrUn_KI270518v1 LN:2186
@SQ SN:chrUn_KI270508v1 LN:1951
@SQ SN:chrUn_KI270516v1 LN:1300
@SQ SN:chrUn_KI270512v1 LN:22689
@SQ SN:chrUn_KI270519v1 LN:138126
@SQ SN:chrUn_KI270522v1 LN:5674
@SQ SN:chrUn_KI270511v1 LN:8127
@SQ SN:chrUn_KI270515v1 LN:6361
@SQ SN:chrUn_KI270507v1 LN:5353
@SQ SN:chrUn_KI270517v1 LN:3253
@SQ SN:chrUn_KI270529v1 LN:1899
@SQ SN:chrUn_KI270528v1 LN:2983
@SQ SN:chrUn_KI270530v1 LN:2168
@SQ SN:chrUn_KI270539v1 LN:993
@SQ SN:chrUn_KI270538v1 LN:91309
@SQ SN:chrUn_KI270544v1 LN:1202
@SQ SN:chrUn_KI270548v1 LN:1599
@SQ SN:chrUn_KI270583v1 LN:1400
@SQ SN:chrUn_KI270587v1 LN:2969
@SQ SN:chrUn_KI270580v1 LN:1553
@SQ SN:chrUn_KI270581v1 LN:7046
@SQ SN:chrUn_KI270579v1 LN:31033
@SQ SN:chrUn_KI270589v1 LN:44474
@SQ SN:chrUn_KI270590v1 LN:4685
@SQ SN:chrUn_KI270584v1 LN:4513
@SQ SN:chrUn_KI270582v1 LN:6504
@SQ SN:chrUn_KI270588v1 LN:6158
@SQ SN:chrUn_KI270593v1 LN:3041
@SQ SN:chrUn_KI270591v1 LN:5796
@SQ SN:chrUn_KI270330v1 LN:1652
@SQ SN:chrUn_KI270329v1 LN:1040
@SQ SN:chrUn_KI270334v1 LN:1368
@SQ SN:chrUn_KI270333v1 LN:2699
@SQ SN:chrUn_KI270335v1 LN:1048
@SQ SN:chrUn_KI270338v1 LN:1428
@SQ SN:chrUn_KI270340v1 LN:1428
@SQ SN:chrUn_KI270336v1 LN:1026
@SQ SN:chrUn_KI270337v1 LN:1121
@SQ SN:chrUn_KI270363v1 LN:1803
@SQ SN:chrUn_KI270364v1 LN:2855
@SQ SN:chrUn_KI270362v1 LN:3530
@SQ SN:chrUn_KI270366v1 LN:8320
@SQ SN:chrUn_KI270378v1 LN:1048
@SQ SN:chrUn_KI270379v1 LN:1045
@SQ SN:chrUn_KI270389v1 LN:1298
@SQ SN:chrUn_KI270390v1 LN:2387
@SQ SN:chrUn_KI270387v1 LN:1537
@SQ SN:chrUn_KI270395v1 LN:1143
@SQ SN:chrUn_KI270396v1 LN:1880
@SQ SN:chrUn_KI270388v1 LN:1216
@SQ SN:chrUn_KI270394v1 LN:970
@SQ SN:chrUn_KI270386v1 LN:1788
@SQ SN:chrUn_KI270391v1 LN:1484
@SQ SN:chrUn_KI270383v1 LN:1750
@SQ SN:chrUn_KI270393v1 LN:1308
@SQ SN:chrUn_KI270384v1 LN:1658
@SQ SN:chrUn_KI270392v1 LN:971
@SQ SN:chrUn_KI270381v1 LN:1930
@SQ SN:chrUn_KI270385v1 LN:990
@SQ SN:chrUn_KI270382v1 LN:4215
@SQ SN:chrUn_KI270376v1 LN:1136
@SQ SN:chrUn_KI270374v1 LN:2656
@SQ SN:chrUn_KI270372v1 LN:1650
@SQ SN:chrUn_KI270373v1 LN:1451
@SQ SN:chrUn_KI270375v1 LN:2378
@SQ SN:chrUn_KI270371v1 LN:2805
@SQ SN:chrUn_KI270448v1 LN:7992
@SQ SN:chrUn_KI270521v1 LN:7642
@SQ SN:chrUn_GL000195v1 LN:182896
@SQ SN:chrUn_GL000219v1 LN:179198
@SQ SN:chrUn_GL000220v1 LN:161802
@SQ SN:chrUn_GL000224v1 LN:179693
@SQ SN:chrUn_KI270741v1 LN:157432
@SQ SN:chrUn_GL000226v1 LN:15008
@SQ SN:chrUn_GL000213v1 LN:164239
@SQ SN:chrUn_KI270743v1 LN:210658
@SQ SN:chrUn_KI270744v1 LN:168472
@SQ SN:chrUn_KI270745v1 LN:41891
@SQ SN:chrUn_KI270746v1 LN:66486
@SQ SN:chrUn_KI270747v1 LN:198735
@SQ SN:chrUn_KI270748v1 LN:93321
@SQ SN:chrUn_KI270749v1 LN:158759
@SQ SN:chrUn_KI270750v1 LN:148850
@SQ SN:chrUn_KI270751v1 LN:150742
@SQ SN:chrUn_KI270752v1 LN:27745
@SQ SN:chrUn_KI270753v1 LN:62944
@SQ SN:chrUn_KI270754v1 LN:40191
@SQ SN:chrUn_KI270755v1 LN:36723
@SQ SN:chrUn_KI270756v1 LN:79590
@SQ SN:chrUn_KI270757v1 LN:71251
@SQ SN:chrUn_GL000214v1 LN:137718
@SQ SN:chrUn_KI270742v1 LN:186739
@SQ SN:chrUn_GL000216v2 LN:176608
@SQ SN:chrUn_GL000218v1 LN:161147
@SQ SN:chrEBV   LN:171823
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -p -Y -K 100000000 -t 24 /faststorage/project/MomaReference/BACKUP/hg38/software_indexes/bwa/GCA_000001405.15_GRCh38_no_alt_analysis_set.GRC_exclusions_masked.fna -
@PG ID:samtools PN:samtools PP:bwa  VN:1.14 CL:samtools sort -o ERR1955330.bam -
@PG ID:MarkDuplicates   VN:Version:2.27.0   CL:MarkDuplicates --INPUT ERR1955330.bam --OUTPUT ERR1955330.aligned.sorted.markdup.bam --METRICS_FILE ERR1955330.markdup_metrics.txt --ASSUME_SORTED true --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 --VALIDATION_STRINGENCY SILENT --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false   PN:MarkDuplicates
sbslee commented 1 year ago

@muhligs,

Thanks for sending the header. As you can see, there is no read group tag (i.e. a line that starts with @RG) in the header, which should contain the sample name field (SM). You usually specify read group during the alignment step (when running bwa mem in your case). Additionally, I checked your attached files and, as speculated, the original BAM path was set as sample name.

Please add read group to your BAM to give a proper sample name and then run PyPGx on the new BAM. You can find more examples of read group in the Internet, but below is one example:

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI
muhligs commented 1 year ago

I will do that. Thanks for your quick and thorough reply.

sbslee commented 1 year ago

FYI, you can easily add read group to existing BAM using existing tools like samtools addreplacerg.

muhligs commented 1 year ago

Thanks again. That was (also) useful :)