icgc-argo / argo-clinical

Clinical data submission for ARGO programs.
GNU Affero General Public License v3.0
2 stars 0 forks source link

🐛 Sample registration does not ignore casing for IDs #542

Open hknahal opened 4 years ago

hknahal commented 4 years ago

Describe the bug

If a data submitter submits the same donor or specimen or sample ID but with different casing (ie. pat_01 vs Pat_01), they are assigned different ARGO IDs (ie. two different DO IDs). Shouldn't casing be ignored? How will the end user (ie. someone searching the Portal later) know what casing to use to search for a donor?

Steps To Reproduce

  1. Go to "Sample Registration"
  2. Upload the following TSV file:
    program_id | submitter_donor_id | gender | submitter_specimen_id | specimen_tissue_source | tumour_normal_designation | specimen_type | submitter_sample_id | sample_type
    -- | -- | -- | -- | -- | -- | -- | -- | --
    HNAHAL-CA | pat_01 | Male | pat_01_sp_01 | Solid tissue | Tumour | Primary tumour | pat_01_sa_01 | Total DNA
    HNAHAL-CA | pat_01 | Male | pat_01_sp_02 | Solid tissue | Normal | Normal | pat_01_sa_02 | Total DNA
    HNAHAL-CA | pat_01 | Male | Pat_01_sp_01 | Solid tissue | Tumour | Primary tumour | Pat_01_sa_01 | Total DNA
    HNAHAL-CA | pat_01 | Male | Pat_01_sp_02 | Solid tissue | Normal | Normal | Pat_01_sa_02 | Total DNA
    HNAHAL-CA | Pat_01 | Male | PAT_01_SP_01 | Solid tissue | Tumour | Primary tumour | PAT_01_SA_01 | Total DNA
    HNAHAL-CA | Pat_01 | Male | PAT_01_SP_02 | Solid tissue | Normal | Normal | PAT_01_SA_02 | Total DNA
  3. Register the samples
  4. Go to "Dashboard" and you will see that the submitted donor IDs Pat_01 and pat_01 are treated as two different donors with separate DO IDs (DO259139 and DO259138 respectively). Likewise, although the specimen IDs for donor pat_01 are the same, they only differ in that the first letter is capitalized (ie. Pat_01_sp_01 vs pat_01_sp_01), so they appear as two separate tumour specimens.

same_donor_different_case

Expected behaviour

Upper/lower casing should be ignored. For example, the following specimen IDs should be considered the same:

Pat_01_sp_01 pat_01_sp_01

If a data submitter is using Excel to put together the sample_registration.tsv file, Excel will sometimes automatically capitalize the first letter of an ID. Perhaps the data submitter meant to submit pat_01 but it got changed to Pat_01 and now they are registered as two different donors.

blabadi commented 4 years ago

as far as I remember we agreed submitter Ids are case sensitive

and for completeness the reason was that we don't know, or we know that some systems used by submitters are case sensitive so that's the smallest common denominator

rosibaj commented 4 years ago

Reference: https://github.com/icgc-argo/argo-clinical/issues/139

rosibaj commented 4 years ago

In the past I suggested a config per program for case_sensitive_ids with the default being false.

We could (at our discretion based on the submitting system) allow some submitters, but prevent issues on the other submitting programs having data duplication resulting from format issues.