opensafely-core / opencodelists

OpenCodelists is an open platform for creating and sharing codelists of clinical terms and drugs.
https://www.opencodelists.org
Other
31 stars 11 forks source link

AssertionError when downloading SSRI codelist #2047

Open sentry-io[bot] opened 1 month ago

sentry-io[bot] commented 1 month ago

Sentry Issue: OPENCODELISTS-QE

AssertionError: 
(1 additional frame(s) were not displayed)
...
  File "codelists/views/decorators.py", line 67, in wrapped_view
    rsp = view_fn(request, version, **view_kwargs)
  File "codelists/views/version_download.py", line 16, in version_download
    clv.csv_data_for_download(
  File "codelists/models.py", line 563, in csv_data_for_download
    self.formatted_table(
  File "codelists/models.py", line 710, in formatted_table
    assert self.downloadable
Jongmassey commented 4 weeks ago

This is caused by the header of the csv data for this codelist being missing (i.e. the first line is a data row 'VMP,321987003,Citalopram 20mg tablets,0403030D0AAAAAA')

see updated analysis

finding all the Codelist Versions with likely "bad" headers

>>> badversions = [c for c in CodelistVersion.objects.all() if c.csv_data and re.search(r'\d\d',c.csv_data.partition('\n')[0]) and 'icd10' not in c.csv_data.partition('\n')[0].lower() and 'grouping' not in c.csv_data.partition('\n')[0].lower()]
>>> len(badversions)
57
>>> urls = ['https://opencodelists.org'+b.get_absolute_url() for b in badversions]

here are those URLs:

Jongmassey commented 4 weeks ago

Should we:

Jongmassey commented 4 weeks ago

I initially thought that the large number of OPCS4 codelists in this list were an error in my "bad codelist" heuristic, but a quick inspect shows that not to be the case

Nonetheless, I have tried re-run my rough analysis with the actual logic of the downloadable property which is at the root of this 500 error:

>>> notdownloadable = [c for c in CodelistVersion.objects.all() if not c.downloadable and c.csv_data and c.coding_system_id != "null"]
>>> len(notdownloadable)
61
>>> urls = ['https://opencodelists.org'+n.get_absolute_url() for n in notdownloadable]
Jongmassey commented 4 weeks ago

We should also consider checking to see whether any of these have been used in studies. In the cases where they are not downloadable due to the first row being a data row rather than a header, there is a chance that the code that is in this first row will be omitted from the analysis. I'd like to think that either opensafely codelists ... or codelist_from_csv() would throw an error at the lack of a header but it might be worth checking.

Jongmassey commented 3 weeks ago

None of these have any user/collaborator associations

>>> users = {}
>>> for n in notdownloadable:
...   collaborators = [u.username for u in n.codelist.collaborations.all()] or ['None']
...   for c in collaborators:
...     users[c] = users.get(c,0) + 1
...
>>> users
{'None': 61}
Jongmassey commented 3 weeks ago

The author field is better populated, however:


>>> authors = {}
>>> for n in notdownloadable:
...   username = n.author.username if n.author else 'None'
...   authors[username] = authors.get(username,0) + 1
...
>>> authors
{'None': 49, 'rriefu': 4, 'yayang': 4, 'r_denholm': 1, 'colincrooks': 2, 'caroline-morton': 1}
Jongmassey commented 3 weeks ago

They're all from September 2022 or earlier:

>>> created = {}
>>> for n in notdownloadable:
...   ct = n.created_at.strftime('%Y%m') or 'None'
...   created[ct] = created.get(ct,0) + 1 or 'None'
...
>>> created
{'202010': 1, '202011': 1, '202101': 1, '202102': 13, '202103': 1, '202108': 2, '202109': 28, '202110': 1, '202111': 4, '202201': 1, '202202': 4, '202203': 3, '202209': 1}