ucam-department-of-psychiatry / crate

Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
GNU General Public License v3.0
19 stars 7 forks source link

SQL Server report fix #141

Closed RudolfCardinal closed 4 months ago

RudolfCardinal commented 5 months ago

The SQL Alchemy distinct() syntax used by the researcher report function worked fine on MySQL, but was combined wrongly with the limit() syntax (DISTINCT TOP versus TOP DISTINCT) under SQL Server. Fixed, but also created tests/test_workflow.py and associated Docker files to semi-automate this (and potentially other) tests under MySQL, SQL Server and PostgreSQL (the last not quite finished because it has an additional database + schema layout to think through). Currently this is not fully automatic, but one command will start a Docker container for the database engine in question and build empty specimen databases, while in a separate shell with the same engine parameters, another copy of the command can build a demo database and run a report on it. (Anonymisation and NLP steps not yet built.)

RudolfCardinal commented 5 months ago

Docs failing to build but for an unrelated Sphinx reason:

Sphinx version error: The sphinxcontrib.applehelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.

... is that fixed in another branch?

martinburchell commented 5 months ago

Docs failing to build but for an unrelated Sphinx reason:

Sphinx version error: The sphinxcontrib.applehelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.

... is that fixed in another branch?

Try

git cherry-pick 5c7a8826cf18932d5be04e53f10b73fb07552dde

You'll also want:

git cherry-pick 2b97515349cfdcde0bc7dbe02819994fdc9ec6c6 git cherry-pick ca091e8a25aed4fcda78bfd1335630180933a78b

RudolfCardinal commented 5 months ago

Fab; thanks!

martinburchell commented 5 months ago

I've independently added database-level testing to the pytest setup on https://github.com/ucam-department-of-psychiatry/crate/tree/fix-optout-col-values-typeerror with optional MySQL support.

I'm wondering if we can combine our efforts here to include the other database backends (we could run them all on GitHub).

It may also be that we can use python-on-whales instead of maintaining our own python wrappers to Docker. We currently use this for the Docker-based installer.

RudolfCardinal commented 5 months ago

I may have made it more awkward than waiting for the port; the startengine command waits for the database to be available (in MySQL that was when its port opened; in SQL Server that was too soon and it waits for ready text via tee too) and then creates the databases, so the databases won't be there instantly when the port opens. Can wait-for-it wait for text, which we could emit from the Dockerfile or the Python?

martinburchell commented 4 months ago

@RudolfCardinal you can see my changes at https://github.com/ucam-department-of-psychiatry/crate/compare/0246db5f99d41b2be8c0809cabf8a6ce900c47af..e504a5867281372d787675e279f94a1c42afe154

I'm calling your end-to-end tests in test_workflow.py "integration tests" and I've merged in my own database tests that are run with pytest. I think there is value in having both sorts of test. The only thing the integration tests are really testing is "can we run the whole workflow without error?" where as the lower level tests can assert for expected results. At present pytest only uses one database.

I've added GitHub workflows to run the integration tests and pytest tests with the different database engines.

I've refactored make_demo_database.py to use Factory Boy. The demo Patient, Note etc are now in testing/models.py and the test factories are in testing/factories.py. I've created Faker custom providers to do most of the demo content creation and these are in testing/providers.py.

I've renamed some of the unittest.TestCase derived classes, which were not being picked up by the pytest runner.

I've added my own simple test for the researcher report with assertions. It's still quite a high level test as it scrapes the generated PDF. It may be better to assume wkhtmltopdf is operating properly and just check the expected values are going to the template by mocking render_to_string(). The test is also probably not very realistic in that it is not using anonymised data. We could create factories for anonymised-looking data if we felt it was more authentic.

I've changed test_workflow.py to use python-on-whales instead of our own python wrappers to Docker. Sorry!

I've added a general test database to the dockerfiles and test_workflow.py for use with pytest.

The startengine command now waits for the message from the container to say the databases have been created.

I've also merged in the changes from #143. You might want to review those separately in that PR to reduce the clutter in this one.

RudolfCardinal commented 4 months ago

Dear Martin,

This looks excellent – thank you! I am mildly confused about which branch it's on, but it looks good. The only thing I spotted is that the alcohol test cases (previously crate_anon/anonymise/make_demo_database.py, now crate_anon/testing/providers.py) has the typo "Abstinant" for "Abstinent", but this isn't new. I don't think the intention was to test typo handling (if this is a reasonably likely typo and it probably is) I should amend the NLP regex to allow it. So I'll do that – on which branch?

all the best, Rudolf.


From: Martin Burchell @.> Sent: 23 February 2024 12:09 To: ucam-department-of-psychiatry/crate @.> Cc: @. @.>; Mention @.***> Subject: Re: [ucam-department-of-psychiatry/crate] SQL Server report fix (PR #141)

@RudolfCardinalhttps://github.com/RudolfCardinal you can see my changes at https://github.com/ucam-department-of-psychiatry/crate/compare/0246db5f99d41b2be8c0809cabf8a6ce900c47af..e504a5867281372d787675e279f94a1c42afe154

I'm calling your end-to-end tests in test_workflow.py "integration tests" and I've merged in my own database tests that are run with pytest. I think there is value in having both sorts of test. The only thing the integration tests are really testing is "can we run the whole workflow without error?" where as the lower level tests can assert for expected results. At present pytest only uses one database.

I've added GitHub workflows to run the integration tests and pytest tests with the different database engines.

I've refactored make_demo_database.py to use Factory Boy. The demo Patient, Note etc are now in testing/models.py and the test factories are in testing/factories.py. I've created Faker custom providers to do most of the demo content creation and these are in testing/providers.py.

I've renamed some of the unittest.TestCase derived classes, which were not being picked up by the pytest runner.

I've added my own simple test for the researcher report with assertions. It's still quite a high level test as it scrapes the generated PDF. It may be better to assume wkhtmltopdf is operating properly and just check the expected values are going to the template by mocking render_to_string(). The test is also probably not very realistic in that it is not using anonymised data. We could create factories for anonymised-looking data if we felt it was more authentic.

I've changed test_workflow.py to use python-on-whaleshttps://github.com/gabrieldemarmiesse/python-on-whales instead of our own python wrappers to Docker. Sorry!

I've added a general test database to the dockerfiles and test_workflow.py for use with pytest.

The startengine command now waits for the message from the container to say the databases have been created.

I've also merged in the changes from #143https://github.com/ucam-department-of-psychiatry/crate/pull/143. You might want to review those separately in that PR to reduce the clutter in this one.

— Reply to this email directly, view it on GitHubhttps://github.com/ucam-department-of-psychiatry/crate/pull/141#issuecomment-1961215392, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQJFSIDYICZNG266KS6W4DYVCBGJAVCNFSM6AAAAABCI2ULNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGIYTKMZZGI. You are receiving this because you were mentioned.Message ID: @.***>