nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Refactor `vidrl_upload` to auto-detect titer info #157

Closed j23414 closed 4 months ago

j23414 commented 4 months ago

Context

Historically, vidrl_upload needed manual updates to row and column coordinates here:

https://github.com/nextstrain/fauna/blob/e58b42458da543be3037fe1754f7117a594cec3a/tdb/vidrl_upload.py#L94-L112

Which could vary by subtype (H1N1pdm, H3N2, Vic), assay type (HI, FRA, MNT), and submission week (perhaps due to different individuals filling in the worksheet). This quickly becomes tedious when processing a large batch of files.

The above is also true for crick_upload and niid_upload, but it will be addressed in separate PRs to better scope handling of data variation across submitting organizations.

Proposed changes

To address the above, this PR refactors the vidrl_upload function to automatically detect titer information from VIDRL Excel (.xlsx) files, importing auto-detect functions from titer_block.py. The approach to autodetect row and column coordinates is documented in these slides and in this Slack discussion and has been validated against many past VIDRL files using this pipeline. This auto-detection seems to perform well for many of the past data, and importantly against 2024 and 2023 data, but I estimate that further refinements will crop up as new data is submitted.

If new data deviates too far from the auto-detect REGEX patterns, I've maintained the ability to manually update row and column coordinates here (to bypass the necessity of defining a new REGEX pattern when speed is paramount):

https://github.com/nextstrain/fauna/blob/2d0b991a0c3b0fd269f5abe6eefa2a106c7dfc50/tdb/vidrl_upload.py#L121-L131

If the auto-detect serum mapping is out of order, I've maintained the ability to manually define the serum mapping dictionary here:

https://github.com/nextstrain/fauna/blob/2d0b991a0c3b0fd269f5abe6eefa2a106c7dfc50/tdb/vidrl_upload.py#L106-L112

Related issue(s)

Checklist