sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 79 forks source link

`sourmash tax prepare` fails with `No taxonomic identifiers found.` #2326

Open taylorreiter opened 1 year ago

taylorreiter commented 1 year ago

Command and output pasted below. Lineages csv attached and reproduced!

sourmash tax prepare --taxonomy-csv inputs/sourmash_databases/cheesegenomes.lineages.csv -o tmp.sqldb

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from 'inputs/sourmash_databases/cheesegenomes.lineages.csv': No taxonomic identifiers found.

cheesegenomes.lineages.csv:

ident,taxid,superkingdom,phylum,class,order,family,genus,species,strain
pcamembertiSAM3_3runs.flye.diamond_microbeProteome922.fs_corrected.pilon,5075,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium camemberti,SAM3_3
pen12.pilon,2720512,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,12
rs17.pilon,5081,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,RS-17
geo.pilon,1173061,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Dipodascaceae,Geotrichum,Geotrichum candidum,geo
JBC_canu.pilon,229535,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium nordicum,JBC
JB370.pilon,40374,Eukaryota,Ascomycota,Sordariomycetes,Microascales,Microascaceae,Scopulariopsis,Scopulariopsis sp.,JB370
135e.pilon,45537,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,,Diutina,Diutina catenulata,135e
135B.pilon,4959,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Debaryomycetaceae,Debaryomyces,Debaryomyces hansenii,135B

I can't think what would be causing this...I tried to essentially copy the genbank lineage formats.

taylorreiter commented 1 year ago

Probably should have tagged @bluegenes in this!

ctb commented 1 year ago

some sort of weird formatting issue that affects the csv module but not pandas.read_csv.

Screen Shot 2022-10-13 at 9 57 33 AM

The file is in DOS format but ... weird. Nothing (vi, emacs, Mac OS Numbers) has a problem with it!

python code to reproduce:

import csv
r = csv.reader(open(filename, newline=''))

for row in r:
    print(row)
    break

tl;dr open, save as CSV, try again.

taylorreiter commented 1 year ago

ya that's deeply annoying and the solution. I read it into R and wrote it out again and the problems were fixed. Doing so in vim or excel did not fix it. le sigh. thank you for your help!!!!

ctb commented 1 year ago

leave this open and I'll add something to the error output listing the headers that WERE found...

taylorreiter commented 1 year ago

🪄 🌟 thank you!

ctb commented 1 year ago

ah-hah! figured it out:

Screen Shot 2022-10-13 at 2 53 25 PM

this is the "byte order mark (BOM)" that means this file is UTF-8 encoded. See https://stackoverflow.com/questions/50130605/python-2-7-csv-file-read-write-xef-xbb-xbf-code.

I'm not sure what the right move is here but at least I know what it is now!

ctb commented 1 year ago

PR #2333 adds the following output:

% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'

Note the error output ("headers are") will be standard across all CSV-loading attempts, this is just an example using the tax summarize command (also new in #2333).

ctb commented 1 year ago

asking question here:

https://twitter.com/ctitusbrown/status/1581666825623855104

ctb commented 1 year ago

This Arrow PR adds support for BOM: https://github.com/apache/arrow/pull/11892

ctb commented 1 year ago

Clearer error message added in https://github.com/sourmash-bio/sourmash/pull/2333