Which could vary by subtype (H1N1pdm, H3N2, Vic), assay type (HI, FRA, MNT), and submission week (perhaps due to different individuals filling in the worksheet). This quickly becomes tedious when processing a large batch of files.
The above is also true for crick_upload and niid_upload, but it will be addressed in separate PRs to better scope handling of data variation across submitting organizations.
Proposed changes
To address the above, this PR refactors the vidrl_upload function to automatically detect titer information from VIDRL Excel (.xlsx) files, importing auto-detect functions from titer_block.py. The approach to autodetect row and column coordinates is documented in these slides and in this Slack discussion and has been validated against many past VIDRL files using this pipeline. This auto-detection seems to perform well for many of the past data, and importantly against 2024 and 2023 data, but I estimate that further refinements will crop up as new data is submitted.
If new data deviates too far from the auto-detect REGEX patterns, I've maintained the ability to manually update row and column coordinates here (to bypass the necessity of defining a new REGEX pattern when speed is paramount):
Context
Historically,
vidrl_upload
needed manual updates to row and column coordinates here:https://github.com/nextstrain/fauna/blob/e58b42458da543be3037fe1754f7117a594cec3a/tdb/vidrl_upload.py#L94-L112
Which could vary by subtype (H1N1pdm, H3N2, Vic), assay type (HI, FRA, MNT), and submission week (perhaps due to different individuals filling in the worksheet). This quickly becomes tedious when processing a large batch of files.
The above is also true for
crick_upload
andniid_upload
, but it will be addressed in separate PRs to better scope handling of data variation across submitting organizations.Proposed changes
To address the above, this PR refactors the
vidrl_upload
function to automatically detect titer information from VIDRL Excel (.xlsx) files, importing auto-detect functions from titer_block.py. The approach to autodetect row and column coordinates is documented in these slides and in this Slack discussion and has been validated against many past VIDRL files using this pipeline. This auto-detection seems to perform well for many of the past data, and importantly against 2024 and 2023 data, but I estimate that further refinements will crop up as new data is submitted.If new data deviates too far from the auto-detect REGEX patterns, I've maintained the ability to manually update row and column coordinates here (to bypass the necessity of defining a new REGEX pattern when speed is paramount):
https://github.com/nextstrain/fauna/blob/2d0b991a0c3b0fd269f5abe6eefa2a106c7dfc50/tdb/vidrl_upload.py#L121-L131
If the auto-detect serum mapping is out of order, I've maintained the ability to manually define the serum mapping dictionary here:
https://github.com/nextstrain/fauna/blob/2d0b991a0c3b0fd269f5abe6eefa2a106c7dfc50/tdb/vidrl_upload.py#L106-L112
Related issue(s)
Checklist