Open violetmadder opened 6 years ago
Sounds like it's just an encoding issue. You need to figure out the encoding of the document you're reading and ensure Python is using the same encoding to "decode" it properly.
Right. I don't know how to do that. I'm trying to find an example of an offending character so I can narrow it down, but there's got to be a better way than trying to scan it with my eyeballs (which is difficult because the columns are not at all lined up so I have to scroll around sideways, and the file is large).
The usda site says this in its FAQ,
Most of our data are downloaded as uncompressed ASCII text
but it seems to me ASCII wouldn't be causing this sort of error? Or does 'uncompressed' mean it's somehow weird?
No. Uncompressed means raw. Shouldn't be causing issues. Looking into it...
Seems to import just fine into Excel as US-ASCII (no warnings). Do you want me to export it as some other format?
FWIW ☝️
Thanks for the xlsx, it appears to work just fine...
But something is seriously weird about the original text I got from the USDA site. I tried putting it in Google Docs, and it wouldn't even open the file!
I've been using multiple different searches on the USDA database-- sometimes the text from the results has had this bug, other times it didn't. No idea what the cause is.
How did you save the text though? Did they provide the file or the raw text?
For example, here's what it would look like if you save a file from Notepad:
From here-- if I check the box that says 'Download text file without formatted display' it just comes up in my browser as a mass of text which I then copy & paste out into notepad++ and save it as "normal text file", which doesn't have the menu in your image. I'll try vanilla notepad instead... and just now did that and it crashed. Er... I'll get back to you on that.
https://plants.usda.gov/adv_search.html
Only some of their plants have extended information like pH requirements, bloom season etc. Some have Accepted Names and some are just a big mess of synonyms, which I don't know quite how to handle yet, and then there's all the cultivars and hybrids, and all sorts of stuff. There's no simple way that I've found to just download the entire database and figure things out from there, so I've been trying to do different searches and then cobble together the results I need.
I see what you're talking about now. I downloaded some data here. You can't just copy/paste it per se, because you don't know the encoding, right? What you can do is:
If they've accurately set the encoding, then UTF-8 should be correct. If not, it's anyone's guess and they've done us all a disservice by missing that tidbit of information.
That said, from most editors you'll have a way to save with some sort of encoding. I've seen "Save with encoding..." in one editor before, but in vscode I just hit Ctrl+Shift+P and type "encoding" to see the following:
and then this...
See how many choices there are! And there's a scrollbar!
Anyway, I think it's pretty much the same in Sublime, last time I used it.
BTW, I strongly suggest you switch to vscode. It is the absolute shit! (in a good way)
Wow, interesting-- for me, the results come back on this address instead:
https://plants.usda.gov/java/AdvancedSearchServlet
No document to handle the way you're describing above. Wonder what we did differently.
Okay meanwhile I think I found a weird character. They're using "×" in hybrid names and a few other places. Not sure if that's enough to cause encoding freakouts though.
Regardless of where you end up, the steps should be the same for identifying the encoding.
I wonder if there's a way you could simply strip out all unrecognized characters as you read in the input file.
when I'm looking at the AdvancedSearchServlet page, there's no document to examine with the development tools steps you described earlier.
Because we're doing something different. I need some better exact steps to reproduce.
You could strip characters if you want. I would do that with a regular expression to eliminate non-word characters and non-space characters perhaps? You might have to throw some more stuff in there like hyphens and apostrophes too.
Exact steps to reproduce... well, I go to the advanced search page, I hit some checkboxes on random things I want displayed, then check the "Download text file without formatted display" checkbox, then hit the Display Results button. I've only ever had results on the AdvancedSearchServlet address, I've never once seen it come up with the downloadData address you got with the doc in it.
In the meantime, though, I tried out vscode and I don't know what I did but even the code that was working before has now gone all goofy. So I dunno wtf I'm doing.
OK I got the same results page as you now. It's also text/plain;charset=UTF-8
.
I also see the ×
character in the results.
What were you doing different that gave you the doc? I'm curious to try it if I can.
MEANWHILE, I tried adding errors='ignore' to the readcsv function, and now the code does run, outputting what looks like a mostly intact json dictionary.
I'm seeing weird characters turning up in that json file as things like \u00b0F and \u00fc. I can work with that for the moment, think.
@violetmadder I did essentially the same thing; except, I didn't check the box: "Download text file without formatted display." Once there, I clicked the "Download" link at the top-right of the page.
As for \u00b0F
, the part you're interested in, actually, is just the \u00b0
portion of that, which represents the °
character, encoded as UTF-16 (hex). Try this in your browser's developer console:
console.log('\u00b0'); // °
console.log('\u00b0F'); // °F
As you can see, it's talking about °F
and °C
. That makes sense! The same could be represented in HTML as °
, which if I type in this GitHub comment, automatically converts it to °
if I don't wrap it in backticks.
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
DEC | OCT | HEX | BIN | Symbol | HTML Number | HTML Name | Description |
---|---|---|---|---|---|---|---|
176 | 260 | B0 | 1011000 | ° | ° |
° |
Degree sign |
The fact that the b0
is preceded by a 00
tells me that you're dealing with UTF-16 (hex) encoding. See this page. See also Wikipedia | UTF-16.
Before I tell you how to read a UTF-16 file, I have to tell you first how to SAVE a UTF-16 file. I'm only going to show you how to do that in vscode.
°
character or any amount of UTF-16-encoded text from the USDA site.View -> Command Palette...
to view the command palette.So, how do you read it in Python (I hear you asking)? Excellent question!
import io
file = io.open('foo.txt', 'r', encoding='utf-16-le')
file.read()
'\ufeff°'
And what is this \ufeff
that precedes the °
(I also hear you asking)? Excellent question again! That is what's called a Byte Order Mark (BOM). Start reading here and make sure to read the table below it here too.
I happen to know a lot about this, because I created this npm package called bombom a while back that nobody uses. Anyway, I mapped those BOMs here.
Basically, the BOM is just a tidbit of data at the beginning of a file that tells editors what kind of file encoding they are dealing with so they don't have to "guess." That's extremely useful! If you had that BOM on the data that the USDA was providing or if they described the proper encoding in the HTTP headers like they should be, then we wouldn't have to go this great length just to detect what we were dealing with in the first place, right?
I like BOMs. Some people don't, because they can create garbage in a command line in some scenarios. I get it, but I want to know WTF file I'm dealing with when I read it, don't you? This is why I created bombom
, so that I could bombom#detect(buffer)
to detect the encoding. What this function does is peeks at the first few bytes of a file raw (as a buffer) and looks for those BOM signatures w/o opening the entire file. Once you have that, you can confidently open the file with the proper encoding!
Clear as mud?
DUDE
....And why the hell do they have the "download" link only showing up on the search results when you HAVEN'T hit the "Download as text" checkbox? This PLANTS project must not be a very serious USDA priority, it's a bit of a mess.
When I try to hit that Download link on the upper right, it tells me
No Data Found
Please try a different query
Even though the formatted search results were right there on display. Yeesh.
You think you can take it from here? Do you have everything you need to make some progress?
the io.open version of my readcsv function is giving me a new error,
Error: field larger than field limit (131072)
might have something to do with not being able to save the file in the manner you described, but still copying & pasting.
At any rate! The /u00 codes, I can work with-- I mean, I can easily find and replace them, there's not very many and in the meantime the code at least runs. It's enough to let me focus on making the dictionaries are working the way I want them to. I'm sure I'll come up with more questions as I go but at least we got this thing moving! Thanks so much!
How big is the file you saved? I think it's complaining that your file is too large. If that's the case, you might have to read one buffer at a time until you're done.
Why are you still copying/pasting? You can't save it the way I said? What's wrong?
Yeah, I can't save it the way you said. When I click the download link that shows up in the upper righthand corner of the formatted search results display, it tells me "No Data Found Please try a different query" instead of giving me a document.
Well, how did you save the first document? Just take that document and re-save it as UTF-16 LE. Should work!
after that, the vanilla readcsv() now gives the same field error the ioread one does.
See https://github.com/violetmadder/plants/pull/6 for a fix for that.
Gotcha! That should work!
The csv content I'm pulling from the USDA PLANTS database has weird characters in it causing errors like this:
I'm pretty sure the offending bits are in the 'Species' field of the csv, but I haven't figured out how to home in on what they are, and I don't know how to adjust my readcsv function so it can handle them.