BAM Identifiers/Filenames Are Alphanumeric Strings When Viewing Results

erichards52 commented 1 year ago

Hello,

For reference I am using GeT-RM BAM files directly from the ENA: https://www.ebi.ac.uk/ebisearch/search?query=PRJEB19931&requestFrom=ebi_index&db=allebi

I have the data stored, as an example, as such: /data/bam_files/ERR195/ERR1955341/NA11993.bam

I am calling the pipeline and all other commands building up to it as such:

pypgx run-ngs-pipeline CYP2D6 grch37-CYP2D6-pipeline_get_rm_1 --variants grch37-variants_get_rm_1.vcf.gz --depth-of-coverage grch37-depth-of-coverage_get_rm_1.zip --control-statistics grch37-control-statistics-RYR1_get_rm_1.zip

I have followed the tutorial and no warnings/issues seem to be present, but when viewing the output of pypgx print-data grch37-CYP2D6-pipeline_get_rm_1/results.zip | head, I get the following:

Genotype        Phenotype       Haplotype1      Haplotype2      AlternativePhase        VariantData     CNV
20b87673c1224e9db8bdbbe82899309c        *4/*5   Poor Metabolizer        *4;*10;*74;*2;  *4;*10;*74;*2;  ;       *4:22-42524947-C-T:0.95;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default;     WholeDel1

Could you please tell me how I can get these names to show up in an informative/consistent way?

Thank you.

sbslee commented 1 year ago

@erichards52,

I'm not 100% sure if I understand your question. When you say 'names' do you mean sample names such as 20b87673c1224e9db8bdbbe82899309c in the example you provided (and you want it to be displayed as HG00276 instead)? If that's the case, then you have to change the SM (sample name) tag from the RG (read group) section of the BAM file. Basically, 20b87673c1224e9db8bdbbe82899309c is the sample name GeT-RM has chosen to assign to HG00276. Changing the SM tag is simple, but it can be tricky if you are not familiar with BAM manipulation. My advise is to leave the BAM files as is and just manually change sample names after you have generated diplotype calls. At least that's what I did for my publication. See the attached file for mapping sample names: TableS1.xlsx.

Let me know if this is not what you meant.

sbslee commented 1 year ago

BTW, HG00276 does have CYP2D6*4/*5 (*5 is gene deletion) so you can be assured that the pipeline ran fine :)

GRCh37-CYP2D6-1 gene-model-CYP2D6-2

erichards52 commented 1 year ago

Thank you!

You've been a great help!

This is exactly what I needed :)

sbslee / pypgx

BAM Identifiers/Filenames Are Alphanumeric Strings When Viewing Results #87