morinlab / GAMBLR.data

Collection of Curated Data for Genomic Analysis of Mature B-cell Lymphomas in R
https://morinlab.github.io/GAMBLR.data/
MIT License
2 stars 0 forks source link

Duplicate sample ids in the metadata #81

Closed vladimirsouza closed 5 months ago

vladimirsouza commented 5 months ago

I don't know whether this is really a problem. The duplications come from different cohorts.

> my_meta_genome_capture = get_gambl_metadata(seq_type_filter = c("capture", "genome"))
Using the bundled metadata in GAMBLR.data...
> 
> duplicate_sample_ids <- duplicated(my_meta_genome_capture$sample_id) %>% 
+   my_meta_genome_capture$sample_id[.] %>% 
+   unique
> duplicate_sample_ids
[1] "05-32150T" "08-15460T" "09-33003T" "15-13383T" "17-36275T"
> 
> filter(my_meta_genome_capture, sample_id %in% duplicate_sample_ids) %>% 
+   split(.$sample_id)
$`05-32150T`
  patient_id sample_id Tumor_Sample_Barcode seq_type sex COO_consensus lymphgen genetic_subgroup EBV_status_inf       cohort pathology
1   05-32150 05-32150T            05-32150T   genome   F           ABC      MCD              dFL           <NA>    FL_Dreval     DLBCL
6   05-32150 05-32150T            05-32150T   genome   F           ABC      MCD             <NA>           <NA> DLBCL_Hilton     DLBCL
  reference_PMID genome_build pairing_status age_group compression bam_available pathology_rank DHITsig_consensus ffpe_or_frozen fl_grade
1       37084389         <NA>           <NA>      <NA>        <NA>            NA             NA              <NA>           <NA>     <NA>
6       37319384       grch37        matched     Other         bam          TRUE             19        DHITsigNeg         frozen     <NA>
  hiv_status lymphgen_cnv_noA53 lymphgen_no_cnv lymphgen_with_cnv lymphgen_wright molecular_BL normal_sample_id time_point
1       <NA>               <NA>            <NA>              <NA>            <NA>         <NA>             <NA>       <NA>
6       <NA>                MCD             MCD               MCD           Other         <NA>        05-32150N          A

$`08-15460T`
  patient_id sample_id Tumor_Sample_Barcode seq_type sex COO_consensus lymphgen genetic_subgroup EBV_status_inf       cohort pathology
2   08-15460 08-15460T            08-15460T   genome   F       UNCLASS      BN2              dFL           <NA>    FL_Dreval     DLBCL
7   08-15460 08-15460T            08-15460T   genome   F       UNCLASS      BN2             <NA>           <NA> DLBCL_Hilton     DLBCL
  reference_PMID genome_build pairing_status age_group compression bam_available pathology_rank DHITsig_consensus ffpe_or_frozen fl_grade
2       37084389         <NA>           <NA>      <NA>        <NA>            NA             NA              <NA>           <NA>     <NA>
7       37319384       grch37        matched     Other         bam          TRUE             19        DHITsigNeg         frozen     <NA>
  hiv_status lymphgen_cnv_noA53 lymphgen_no_cnv lymphgen_with_cnv lymphgen_wright molecular_BL normal_sample_id time_point
2       <NA>               <NA>            <NA>              <NA>            <NA>         <NA>             <NA>       <NA>
7        NEG                BN2             BN2               BN2           Other         <NA>        08-15460N          A

$`09-33003T`
  patient_id sample_id Tumor_Sample_Barcode seq_type sex COO_consensus lymphgen genetic_subgroup EBV_status_inf       cohort pathology
3   09-33003 09-33003T            09-33003T   genome   M           GCB      BN2              dFL           <NA>    FL_Dreval     DLBCL
8   09-33003 09-33003T            09-33003T   genome   M           GCB      BN2             <NA>           <NA> DLBCL_Hilton     DLBCL
  reference_PMID genome_build pairing_status age_group compression bam_available pathology_rank DHITsig_consensus ffpe_or_frozen fl_grade
3       37084389         <NA>           <NA>      <NA>        <NA>            NA             NA              <NA>           <NA>     <NA>
8       37319384       grch37        matched     Other        cram          TRUE             19        DHITsigNeg         frozen     <NA>
  hiv_status lymphgen_cnv_noA53 lymphgen_no_cnv lymphgen_with_cnv lymphgen_wright molecular_BL normal_sample_id time_point
3       <NA>               <NA>            <NA>              <NA>            <NA>         <NA>             <NA>       <NA>
8       <NA>                BN2             BN2               BN2            <NA>         <NA>  09-33003_normal          A

$`15-13383T`
  patient_id sample_id Tumor_Sample_Barcode seq_type sex COO_consensus lymphgen genetic_subgroup EBV_status_inf       cohort pathology
4   15-13383 15-13383T            15-13383T   genome   F           ABC      BN2              dFL           <NA>    FL_Dreval     DLBCL
9   15-13383 15-13383T            15-13383T   genome   F           ABC      BN2             <NA>           <NA> DLBCL_Hilton     DLBCL
  reference_PMID genome_build pairing_status age_group compression bam_available pathology_rank DHITsig_consensus ffpe_or_frozen fl_grade
4       37084389         <NA>           <NA>      <NA>        <NA>            NA             NA              <NA>           <NA>     <NA>
9       37319384       grch37        matched     Other         bam          TRUE             19        DHITsigNeg         frozen     <NA>
  hiv_status lymphgen_cnv_noA53 lymphgen_no_cnv lymphgen_with_cnv lymphgen_wright molecular_BL normal_sample_id time_point
4       <NA>               <NA>            <NA>              <NA>            <NA>         <NA>             <NA>       <NA>
9        NEG                BN2           Other               BN2            <NA>         <NA>        15-13383N          A

$`17-36275T`
   patient_id sample_id Tumor_Sample_Barcode seq_type sex COO_consensus lymphgen genetic_subgroup EBV_status_inf       cohort pathology
5    17-36275 17-36275T            17-36275T   genome   M           GCB    Other              dFL           <NA>    FL_Dreval     DLBCL
10   17-36275 17-36275T            17-36275T   genome   M           GCB    Other             <NA>           <NA> DLBCL_Hilton     DLBCL
   reference_PMID genome_build pairing_status age_group compression bam_available pathology_rank DHITsig_consensus ffpe_or_frozen fl_grade
5        37084389         <NA>           <NA>      <NA>        <NA>            NA             NA              <NA>           <NA>     <NA>
10       37319384       grch37        matched     Other        cram          TRUE             19        DHITsigNeg         frozen     <NA>
   hiv_status lymphgen_cnv_noA53 lymphgen_no_cnv lymphgen_with_cnv lymphgen_wright molecular_BL normal_sample_id time_point
5        <NA>               <NA>            <NA>              <NA>            <NA>         <NA>             <NA>       <NA>
10        NEG              Other           Other             Other            <NA>         <NA>  17-36275_normal          A
Kdreval commented 5 months ago

This is expected because the same sample can be part of different studies.