nrlab-CRUK / INVAR2

restructures version of invar
5 stars 4 forks source link

Reading CSV files doesn't cope with Excel byte order marks #5

Closed rich7409 closed 1 year ago

rich7409 commented 1 year ago

We've hit problems with the first column, typically "CHROM", doesn't exist. The reason being the CSV file has byte order marks (BOM) at the start of the file.

We should look at a way of handling files with BOMs at the beginning and read them properly. Seems there are problems with read_csv in this regard. See:

https://github.com/ropensci-archive/gtfsr/issues/19#issuecomment-247766324 https://stackoverflow.com/questions/39593637/dealing-with-byte-order-mark-bom-in-r

rich7409 commented 1 year ago

Actually it seems that R is now handling the byte order marks correctly, fixed after the issues above. The problem is our validation.nf isn't, and stops the pipeline immediately with the error about the missing CHROM column. I've changed validation.nf to use the Apache Commons IO BOMInputStream to detect them and, at present, print a warning that the file contains the marker when there is one. It might be that the warning isn't helpful and can be removed, but for the time being we'll leave it there.