ncihtan / hdash_air

MIT License
0 stars 0 forks source link

Review ID format check #11

Closed clarisse-lau closed 5 months ago

clarisse-lau commented 1 year ago

During the release process, our internal release scripts picked up on some HTAN IDs that did not match the HTAN ID Format SOP. However, these errors were not listed on hdash in the Primary IDs follow the HTAN ID Spec section.

For example, WashU had submitted parent biospecimen IDs such as CE336E1-S1, HT128B1-S1H4, and P5296-1N2) (These have now been resolved, but see example non-conforming IDs in the previous version (v6) of the snATAC-Seq_level_1_atac_tumor manifest https://www.synapse.org/#!Synapse:syn52257214.6 )

Currently the regex rule used for ID validation is ^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\d*|0000))_([0-9]\d*|0000)$.

clarisse-lau commented 1 year ago

This may be because the check is only being applied to primary IDs and not 'parent' IDs?

clarisse-lau commented 1 year ago

Also one more note, the above regex works for HTAN Data File IDs and HTAN Biospecimen IDs (eg HTA11_120_1211), but not participant IDs (eg HTA11_120)

clarisse-lau commented 10 months ago

Updated HTAN ID regex rules (distinct for file, biospecimen, and participant IDs) can be found here: https://github.com/ncihtan/data-models/issues/268#issue-1808171944

ecerami commented 5 months ago

This is now fixed and deployed. I relaxed the leading zero filter.