Open sentry-io[bot] opened 1 month ago
This is caused by the header of the csv data for this codelist being missing (i.e. the first line is a data row 'VMP,321987003,Citalopram 20mg tablets,0403030D0AAAAAA'
)
finding all the Codelist Versions with likely "bad" headers
>>> badversions = [c for c in CodelistVersion.objects.all() if c.csv_data and re.search(r'\d\d',c.csv_data.partition('\n')[0]) and 'icd10' not in c.csv_data.partition('\n')[0].lower() and 'grouping' not in c.csv_data.partition('\n')[0].lower()]
>>> len(badversions)
57
>>> urls = ['https://opencodelists.org'+b.get_absolute_url() for b in badversions]
here are those URLs:
Should we:
I initially thought that the large number of OPCS4 codelists in this list were an error in my "bad codelist" heuristic, but a quick inspect shows that not to be the case
Nonetheless, I have tried re-run my rough analysis with the actual logic of the downloadable
property which is at the root of this 500 error:
>>> notdownloadable = [c for c in CodelistVersion.objects.all() if not c.downloadable and c.csv_data and c.coding_system_id != "null"]
>>> len(notdownloadable)
61
>>> urls = ['https://opencodelists.org'+n.get_absolute_url() for n in notdownloadable]
We should also consider checking to see whether any of these have been used in studies. In the cases where they are not downloadable
due to the first row being a data row rather than a header, there is a chance that the code that is in this first row will be omitted from the analysis. I'd like to think that either opensafely codelists ...
or codelist_from_csv()
would throw an error at the lack of a header but it might be worth checking.
None of these have any user/collaborator associations
>>> users = {}
>>> for n in notdownloadable:
... collaborators = [u.username for u in n.codelist.collaborations.all()] or ['None']
... for c in collaborators:
... users[c] = users.get(c,0) + 1
...
>>> users
{'None': 61}
The author field is better populated, however:
>>> authors = {}
>>> for n in notdownloadable:
... username = n.author.username if n.author else 'None'
... authors[username] = authors.get(username,0) + 1
...
>>> authors
{'None': 49, 'rriefu': 4, 'yayang': 4, 'r_denholm': 1, 'colincrooks': 2, 'caroline-morton': 1}
They're all from September 2022 or earlier:
>>> created = {}
>>> for n in notdownloadable:
... ct = n.created_at.strftime('%Y%m') or 'None'
... created[ct] = created.get(ct,0) + 1 or 'None'
...
>>> created
{'202010': 1, '202011': 1, '202101': 1, '202102': 13, '202103': 1, '202108': 2, '202109': 28, '202110': 1, '202111': 4, '202201': 1, '202202': 4, '202203': 3, '202209': 1}
Sentry Issue: OPENCODELISTS-QE