Encoding error happening from USDAsearch.txt

violetmadder commented 6 years ago

The csv content I'm pulling from the USDA PLANTS database has weird characters in it causing errors like this:

File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6024: character maps to \

I'm pretty sure the offending bits are in the 'Species' field of the csv, but I haven't figured out how to home in on what they are, and I don't know how to adjust my readcsv function so it can handle them.

jednano commented 6 years ago

Sounds like it's just an encoding issue. You need to figure out the encoding of the document you're reading and ensure Python is using the same encoding to "decode" it properly.

violetmadder commented 6 years ago

Right. I don't know how to do that. I'm trying to find an example of an offending character so I can narrow it down, but there's got to be a better way than trying to scan it with my eyeballs (which is difficult because the columns are not at all lined up so I have to scroll around sideways, and the file is large).

violetmadder commented 6 years ago

The usda site says this in its FAQ,

Most of our data are downloaded as uncompressed ASCII text

but it seems to me ASCII wouldn't be causing this sort of error? Or does 'uncompressed' mean it's somehow weird?

jednano commented 6 years ago

No. Uncompressed means raw. Shouldn't be causing issues. Looking into it...

jednano commented 6 years ago

Seems to import just fine into Excel as US-ASCII (no warnings). Do you want me to export it as some other format?

jednano commented 6 years ago

USDAsearch.xlsx

FWIW ☝️

violetmadder commented 6 years ago

Thanks for the xlsx, it appears to work just fine...

But something is seriously weird about the original text I got from the USDA site. I tried putting it in Google Docs, and it wouldn't even open the file!

I've been using multiple different searches on the USDA database-- sometimes the text from the results has had this bug, other times it didn't. No idea what the cause is.

jednano commented 6 years ago

How did you save the text though? Did they provide the file or the raw text?

jednano commented 6 years ago

For example, here's what it would look like if you save a file from Notepad:

violetmadder commented 6 years ago

From here-- if I check the box that says 'Download text file without formatted display' it just comes up in my browser as a mass of text which I then copy & paste out into notepad++ and save it as "normal text file", which doesn't have the menu in your image. I'll try vanilla notepad instead... and just now did that and it crashed. Er... I'll get back to you on that.

https://plants.usda.gov/adv_search.html

Only some of their plants have extended information like pH requirements, bloom season etc. Some have Accepted Names and some are just a big mess of synonyms, which I don't know quite how to handle yet, and then there's all the cultivars and hybrids, and all sorts of stuff. There's no simple way that I've found to just download the entire database and figure things out from there, so I've been trying to do different searches and then cobble together the results I need.

jednano commented 6 years ago

I see what you're talking about now. I downloaded some data here. You can't just copy/paste it per se, because you don't know the encoding, right? What you can do is:

Hit Ctrl+Shift+J in Chrome or navigate to More Tools... Developer Tools to open the Developer Tools window.
Select the Network tab and ensure "All" or "Doc" is selected in the types of responses you wish to see.
You should see a document named "downloadData?fileName=garbage". If not, try refreshing the page.
Select the document name and see the response headers to the right. It should look something like this:

If they've accurately set the encoding, then UTF-8 should be correct. If not, it's anyone's guess and they've done us all a disservice by missing that tidbit of information.

That said, from most editors you'll have a way to save with some sort of encoding. I've seen "Save with encoding..." in one editor before, but in vscode I just hit Ctrl+Shift+P and type "encoding" to see the following:

and then this...

See how many choices there are! And there's a scrollbar!

Anyway, I think it's pretty much the same in Sublime, last time I used it.

jednano commented 6 years ago

BTW, I strongly suggest you switch to vscode. It is the absolute shit! (in a good way)

violetmadder commented 6 years ago

Wow, interesting-- for me, the results come back on this address instead:

https://plants.usda.gov/java/AdvancedSearchServlet

No document to handle the way you're describing above. Wonder what we did differently.

violetmadder commented 6 years ago

Okay meanwhile I think I found a weird character. They're using "×" in hybrid names and a few other places. Not sure if that's enough to cause encoding freakouts though.

jednano commented 6 years ago

Regardless of where you end up, the steps should be the same for identifying the encoding.

amusselm commented 6 years ago

I wonder if there's a way you could simply strip out all unrecognized characters as you read in the input file.

violetmadder commented 6 years ago

when I'm looking at the AdvancedSearchServlet page, there's no document to examine with the development tools steps you described earlier.

untitled

jednano commented 6 years ago

Because we're doing something different. I need some better exact steps to reproduce.

jednano commented 6 years ago

You could strip characters if you want. I would do that with a regular expression to eliminate non-word characters and non-space characters perhaps? You might have to throw some more stuff in there like hyphens and apostrophes too.

violetmadder commented 6 years ago

Exact steps to reproduce... well, I go to the advanced search page, I hit some checkboxes on random things I want displayed, then check the "Download text file without formatted display" checkbox, then hit the Display Results button. I've only ever had results on the AdvancedSearchServlet address, I've never once seen it come up with the downloadData address you got with the doc in it.

In the meantime, though, I tried out vscode and I don't know what I did but even the code that was working before has now gone all goofy. So I dunno wtf I'm doing.

jednano commented 6 years ago

OK I got the same results page as you now. It's also text/plain;charset=UTF-8.

jednano commented 6 years ago

I also see the × character in the results.

violetmadder commented 6 years ago

What were you doing different that gave you the doc? I'm curious to try it if I can.

violetmadder commented 6 years ago

MEANWHILE, I tried adding errors='ignore' to the readcsv function, and now the code does run, outputting what looks like a mostly intact json dictionary.

I'm seeing weird characters turning up in that json file as things like \u00b0F and \u00fc. I can work with that for the moment, think.

jednano commented 6 years ago

@violetmadder I did essentially the same thing; except, I didn't check the box: "Download text file without formatted display." Once there, I clicked the "Download" link at the top-right of the page.

As for \u00b0F, the part you're interested in, actually, is just the \u00b0 portion of that, which represents the ° character, encoded as UTF-16 (hex). Try this in your browser's developer console:

console.log('\u00b0');  // °
console.log('\u00b0F'); // °F

As you can see, it's talking about °F and °C. That makes sense! The same could be represented in HTML as °, which if I type in this GitHub comment, automatically converts it to ° if I don't wrap it in backticks.

The extended ASCII codes (character code 128-255)

There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.

DEC	OCT	HEX	BIN	Symbol	HTML Number	HTML Name	Description
176	260	B0	1011000	°	`°`	`°`	Degree sign

The fact that the b0 is preceded by a 00 tells me that you're dealing with UTF-16 (hex) encoding. See this page. See also Wikipedia | UTF-16.

Before I tell you how to read a UTF-16 file, I have to tell you first how to SAVE a UTF-16 file. I'm only going to show you how to do that in vscode.

Create a new file.
Paste-in the ° character or any amount of UTF-16-encoded text from the USDA site.
Hit Ctrl + Shift + P or View -> Command Palette... to view the command palette.
Type "enco" to filter the command list down to "Change File Encoding". You are presented with more selections.
Select "UTF-16 LE utf16le"
Save the file somewhere.

So, how do you read it in Python (I hear you asking)? Excellent question!

import io
file = io.open('foo.txt', 'r', encoding='utf-16-le')
file.read()
'\ufeff°'

See Python Docs | io.open.

And what is this \ufeff that precedes the ° (I also hear you asking)? Excellent question again! That is what's called a Byte Order Mark (BOM). Start reading here and make sure to read the table below it here too.

I happen to know a lot about this, because I created this npm package called bombom a while back that nobody uses. Anyway, I mapped those BOMs here.

Basically, the BOM is just a tidbit of data at the beginning of a file that tells editors what kind of file encoding they are dealing with so they don't have to "guess." That's extremely useful! If you had that BOM on the data that the USDA was providing or if they described the proper encoding in the HTTP headers like they should be, then we wouldn't have to go this great length just to detect what we were dealing with in the first place, right?

I like BOMs. Some people don't, because they can create garbage in a command line in some scenarios. I get it, but I want to know WTF file I'm dealing with when I read it, don't you? This is why I created bombom, so that I could bombom#detect(buffer) to detect the encoding. What this function does is peeks at the first few bytes of a file raw (as a buffer) and looks for those BOM signatures w/o opening the entire file. Once you have that, you can confidently open the file with the proper encoding!

Clear as mud?

violetmadder commented 6 years ago

DUDE

violetmadder commented 6 years ago

....And why the hell do they have the "download" link only showing up on the search results when you HAVEN'T hit the "Download as text" checkbox? This PLANTS project must not be a very serious USDA priority, it's a bit of a mess.

When I try to hit that Download link on the upper right, it tells me

No Data Found

Please try a different query

Even though the formatted search results were right there on display. Yeesh.

jednano commented 6 years ago

You think you can take it from here? Do you have everything you need to make some progress?

violetmadder commented 6 years ago

the io.open version of my readcsv function is giving me a new error,

Error: field larger than field limit (131072)

might have something to do with not being able to save the file in the manner you described, but still copying & pasting.

At any rate! The /u00 codes, I can work with-- I mean, I can easily find and replace them, there's not very many and in the meantime the code at least runs. It's enough to let me focus on making the dictionaries are working the way I want them to. I'm sure I'll come up with more questions as I go but at least we got this thing moving! Thanks so much!

jednano commented 6 years ago

How big is the file you saved? I think it's complaining that your file is too large. If that's the case, you might have to read one buffer at a time until you're done.

Why are you still copying/pasting? You can't save it the way I said? What's wrong?

violetmadder commented 6 years ago

Yeah, I can't save it the way you said. When I click the download link that shows up in the upper righthand corner of the formatted search results display, it tells me "No Data Found Please try a different query" instead of giving me a document.

jednano commented 6 years ago

Well, how did you save the first document? Just take that document and re-save it as UTF-16 LE. Should work!

violetmadder commented 6 years ago

after that, the vanilla readcsv() now gives the same field error the ioread one does.

jednano commented 6 years ago

See https://github.com/violetmadder/plants/pull/6 for a fix for that.

violetmadder commented 6 years ago

Gotcha! That should work!

violetmadder / plants

Encoding error happening from USDAsearch.txt #3

The extended ASCII codes (character code 128-255)