ncihtan / data-models

Schema.org Data Models for HTAN
MIT License
14 stars 7 forks source link

Improve validation for HTAN IDs #268

Closed adamjtaylor closed 9 months ago

adamjtaylor commented 1 year ago

In light of the revised HTAN identifier SOP, we might want to improve our validation rule for HTAN ID fields: https://github.com/ncihtan/data-models/blob/main/HTAN.model.csv#L11-L16

Currently, these are not validated. We used to validate with regex match HTA_*, but this was removed in Jan 2022 as "the regex would need to be updated"

Propose we update this to below HTAN Data File / Biospecimen ID: regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))_([1-9]\d*|0000)$ warning HTAN Parent Data File / Biospecimen ID: list like :: regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))_([1-9]\d*|0000)$ warning

HTAN Participant ID: regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))$ warning

clarisse-lau commented 1 year ago

Suggest modifying rule to ^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\d*|0000))_([0-9]\d*|0000) to capture IDs such as HTA10_01_00491395202451674766778335220084 submitted by Stanford

adamjtaylor commented 1 year ago

@inodb found that the case of spaces in parent IDs was causing files to be missing from Release 4. Another use case why we should plan for this in the September spring @aclayton555 @elv-sb

aclayton555 commented 1 year ago

Consider: if we implement this, which IDs will this then fail; should these pass or be fixed?

Work through this through Sept sprint ( @clarisse-lau appreciate you help on this). Will bring to an ops call mid sprint to ensure alignment on team, include checks on BQ and dashboard, etc.

aclayton555 commented 11 months ago

Close out 23-09 sprint with this in staging for testing purposes. If okay, HTAN Parent ID also needs to be updated. But we need to see how this works and what breaks

adamjtaylor commented 10 months ago

Testing for HTAN Data File ID with the following query

SELECT 
  HTAN_Center, 
  REGEXP_CONTAINS(HTAN_Data_File_ID, '^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\\d*|0000))_([0-9]\\d*|0000)$') AS regex,
  COUNT(REGEXP_CONTAINS(HTAN_Data_File_ID, '^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\\d*|0000))_([0-9]\\d*|0000)$'))
FROM `htan-dcc.combined_assays.*` 
WHERE (Component LIKE '%Level%') OR (Component LIKE '%Level%')
GROUP BY HTAN_Center, REGEXP_CONTAINS(HTAN_Data_File_ID, '^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\\d*|0000))_([0-9]\\d*|0000)$') 

More than 21k HTAPP files fail this validation 32 SRRS files fail

HTAN_Center regex count
HTAN HTAPP false 21087
HTAN SRRS false 32
adamjtaylor commented 10 months ago

SRRS issues as follows

Row HTAN_Center Filename HTAN_Data_File_ID regex  
1 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-7_Scan1.ome.tiff HTA5_1664_1003-7 false  
2 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-6_Scan1.ome.tiff HTA5_1664_1003-6 false  
3 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-5_Scan1.ome.tiff HTA5_1664_1003-5 false  
4 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-4_Scan1.ome.tiff HTA5_1664_1003-4 false  
5 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-3_Scan1.ome.tiff HTA5_1664_1003-3 false  
6 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-2_Scan1.ome.tiff HTA5_1664_1003-2 false  
7 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-1_Scan1.ome.tiff HTA5_1664_1003-1 false  
8 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-5_Scan1.ome.tiff HTA5_1348_1003-5 false  
9 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-4_Scan1.ome.tiff HTA5_1348_1003-4 false  
10 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-3_Scan1.ome.tiff HTA5_1348_1003-3 false  
11 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-2_Scan1.ome.tiff HTA5_1348_1003-2 false  
12 HTAN SRRS DFCI_imaging_level_2/HTA5_039_40XScan1.ome.tiff HTA5_039_40XScan1.ome.tiff false  
13 HTAN SRRS DFCI_imaging_level_2/HTA5_039_A3_40X_Scan1.ome.tiff HTA5_039_A3 false  
14 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-19.ome.tiff BPC-21-NC-1390-19.ome.tiff false  
15 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-18.ome.tiff BPC-21-NC-1390-18.ome.tiff false  
16 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-17.ome.tiff BPC-21-NC-1390-17.ome.tiff false  
17 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-16.ome.tiff BPC-21-NC-1390-16.ome.tiff false  
18 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-15.ome.tiff BPC-21-NC-1390-15.ome.tiff false  
19 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-14.ome.tiff BPC-21-NC-1390-14.ome.tiff false  
20 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-13.ome.tiff BPC-21-NC-1390-13.ome.tiff false  
21 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-12.ome.tiff BPC-21-NC-1390-12.ome.tiff false  
22 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-11.ome.tiff BPC-21-NC-1390-11.ome.tiff false  
23 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-10.ome.tiff BPC-21-NC-1390-10.ome.tiff false  
24 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-9.ome.tiff BPC-21-NC-1390-9.ome.tiff false  
25 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-8.ome.tiff BPC-21-NC-1390-8.ome.tiff false  
26 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-7.ome.tiff BPC-21-NC-1390-7.ome.tiff false  
27 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-6.ome.tiff BPC-21-NC-1390-6.ome.tiff false  
28 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-5.ome.tiff BPC-21-NC-1390-5.ome.tiff false  
29 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-4.ome.tiff BPC-21-NC-1390-4.ome.tiff false  
30 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-3.ome.tiff BPC-21-NC-1390-3.ome.tiff false  
31 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-2.ome.tiff BPC-21-NC-1390-2.ome.tiff false  
adamjtaylor commented 10 months ago

Here are the first 5 rows of the HTAPP errors

These do not match as they contain a 4th _ separated group

@clarisse-lau do you think these should be valid?

Row HTAN_Center Filename HTAN_Data_File_ID regex  
1 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch03SHIFT_affine.h5 HTA1_982_7629305_02131 false  
2 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch02SHIFT_affine.h5 HTA1_982_7629305_02130 false  
3 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch01SHIFT_affine.h5 HTA1_982_7629305_02129 false  
4 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch00_affine.h5 HTA1_982_7629305_02128 false  
5 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round006_ch03SHIFT_affine.h5 HTA1_982_7629305_02127 false
adamjtaylor commented 10 months ago

@clarisse-lau coming back to this issue. Do we think the HTAPP ids above should be considered valid?

adamjtaylor commented 10 months ago

@elv-sb and I discussed today. As we currently will only generate a warning we suggest leaving as is and allowing failures for the above noted HTAN IDs

clarisse-lau commented 10 months ago

Apologies @adamjtaylor, missed this the first time around!

I agree with the proposed approach (leaving as is). The center ID and participant components are the most crucial, and having an extra "_ group" should not cause issues downstream.

clarisse-lau commented 10 months ago

It sounds like we will be supporting an additional TNP soon that will use HTA16. Suggest modifying the rule for first group to accommodate this?