opensafely-core / cohort-extractor

Cohort extractor tool which can generate dummy data, or real data against OpenSAFELY-compliant research databases
Other
38 stars 13 forks source link

Bad line endings for codelist downloaded on Windows #380

Open inglesp opened 3 years ago

inglesp commented 3 years ago

Angel (on Windows) ran cohortextractor update_codelists and committed the results. This downloaded this CSV file.

When opened in a text editor, or read with Python's csv module, this CSV file has extra lines.

>>> list(csv.reader(open("codelists/opensafely-transient-ischaemic-attack.csv")))[:4]
[['code', 'term'], [], ['14AB.', 'H/O: TIA'], []]

Angel has confirmed this happens even if she doesn't open a text editor:

image

Looking at the contents of the file, between each line there are the bytes 0d (CR) 0d (CR) 0a (LF):

$ cat codelists/opensafely-transient-ischaemic-attack.csv | hexdump -C | head -1
00000000  63 6f 64 65 2c 74 65 72  6d 0d 0d 0a 31 34 41 42  |code,term...14AB|

cohortextractor update_codelists downloads this CSV file with requests, and writes the .text attribute of the response to a file opened with mode w:

rsp = requests.get(url)
with open(path, "w") as f:
    f.write(rsp.text)

(code)

If I run cohortextractor update_codelists on linux, the downloaded file has 0d 0a (CRLF) between lines.

Seb has reproduced on Windows and we've found that

rsp = requests.get(url)
with open(path, "wb") as f:
    f.write(rsp.content)

does the right thing.

inglesp commented 3 years ago

381 has fixed one problem, but has caused another one.

Angel has updated her version of cohortextractor, and after running update_codelists has committed the changes to her codelists. However the GH tests fail, because running update_codelists on GH changes the files.

I have reproduced this locally:

$ git rev-parse HEAD
8f2dcb17967e7fa6402cfb90927a97f2d07b76d7
$ git status --porcelain codelists/opensafely-antiphospholipid-syndrome.csv 
$ cat codelists/opensafely-antiphospholipid-syndrome.csv | hexdump -C | head -1
00000000  63 6f 64 65 2c 74 65 72  6d 0a 58 37 30 34 69 2c  |code,term.X704i,|
$ cohortextractor update_codelists 2>&1 >/dev/null
$ git status --porcelain codelists/opensafely-antiphospholipid-syndrome.csv 
 M codelists/opensafely-antiphospholipid-syndrome.csv
$ cat codelists/opensafely-antiphospholipid-syndrome.csv | hexdump -C | head -1
00000000  63 6f 64 65 2c 74 65 72  6d 0d 0a 58 37 30 34 69  |code,term..X704i|

In the committed version, the lines are separated by 0a (LF), while in the updated version, the lines are separated by 0d 0a (CR LF).

My hypothesis is that git converts the line endings of the CSV file to Unix-style (LF) line endings. But the CSV that is being served by the server has Windows-style (CR LF) line endings.

inglesp commented 3 years ago

A fix might be to run unix2dos (or the equivalent sed command) after update_codelists and before git status.

inglesp commented 3 years ago

I've added *.csv text eol=crlf to .gitattributes, but git status is still showing that there are changes, and git diff gives:

warning: LF will be replaced by CRLF in codelists/opensafely-aplastic-anaemia.csv.
The file will have its original line endings in your working directory

I'm not sure how to interpret that.

evansd commented 3 years ago

I think this is now obviated by opensafely codelists update and we can close?