Open taylorreiter opened 1 year ago
Probably should have tagged @bluegenes in this!
some sort of weird formatting issue that affects the csv
module but not pandas.read_csv
.
The file is in DOS format but ... weird. Nothing (vi, emacs, Mac OS Numbers) has a problem with it!
python code to reproduce:
import csv
r = csv.reader(open(filename, newline=''))
for row in r:
print(row)
break
tl;dr open, save as CSV, try again.
ya that's deeply annoying and the solution. I read it into R and wrote it out again and the problems were fixed. Doing so in vim or excel did not fix it. le sigh. thank you for your help!!!!
leave this open and I'll add something to the error output listing the headers that WERE found...
🪄 🌟 thank you!
ah-hah! figured it out:
this is the "byte order mark (BOM)" that means this file is UTF-8 encoded. See https://stackoverflow.com/questions/50130605/python-2-7-csv-file-read-write-xef-xbb-xbf-code.
I'm not sure what the right move is here but at least I know what it is now!
PR #2333 adds the following output:
% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'
Note the error output ("headers are") will be standard across all CSV-loading attempts, this is just an example using the tax summarize
command (also new in #2333).
asking question here:
This Arrow PR adds support for BOM: https://github.com/apache/arrow/pull/11892
Clearer error message added in https://github.com/sourmash-bio/sourmash/pull/2333
Command and output pasted below. Lineages csv attached and reproduced!
cheesegenomes.lineages.csv:
I can't think what would be causing this...I tried to essentially copy the genbank lineage formats.