Check to strip extraneous .csv information?

pea-arthur commented 8 years ago

I apologize if I'm breaking some community guidelines since this is my first comment on github. Perhaps it belongs on the (nonexistant) wiki page.

This code really is awesome. I work a lot with IDML. I appreciate the unicode update.

I wanted to give a tip to anyone using unicode characters and saving the .csv file as utf-8 with Windows notepad. A Byte order mark(BOM) of 3 extra bytes (EF BB BF) is appended to the beginning of the file which causes an error. There may also be some issues with space characters (A0 D0) as well.

Unfortunately there is no offset function in codecs.open() like regular open().

The solution may be to 1) Open the file using shutil 2) Create a copy in memory 3) Close the file 4) Perform a try statement which check if the first 3 bytes are equal to EF BB BF and remove it if it exists 5) Use codecs.open() on the stripped file

Or create a dictionary of known codes which cause problems and instruct it to skip them. While I'm not having any trouble with Chinese characters, some Korean ones are not passing correctly.

bsalinas commented 8 years ago

Hi @pea-arthur. Thanks for the tip. I am happy to add this into the README, but it sounds like it might be more useful for folks if we are able to fix this problem all together. http://stackoverflow.com/a/14786752 suggests we might try using the csv module in python3. Do you have a specific CSV file that you have used that has caused this issue?

I think the idea of a dictionary of codes to skip also could make sense as a separate feature.

-Ben

pea-arthur commented 8 years ago

Absolutely, I'll provide you with a file. I will take a look at that python 3 library. I need to do some pre-processing on my CSV files anyway: Currently my data only takes up half an A4 page. I want to merge every 2 rows together so I make better use of space.

Disregard the Korean character issues, i was able to get idle to output the files when the windows commandline would not

However, I noticed that if file names were too similar that shutil would create aub directories with files inside and on occasion the tmp directory was not deleted, or windows would not 'let go' of it fast enough, resulting in a failure of the job. A few checks here and there would make this a little more robust. For example, if a row in the csv doesn't have the expected number of entries, it crashes.

I will get back to you soon.

pea-arthur commented 8 years ago

I double checked, since doctype is #!/usr/bin/python3, aren't we already using python 3? https://docs.python.org/3/library/csv.html My version of Python has version 3.5 changes.

Here is a text file manipulated with windows notepad.

original_template_bad_data.txt

Here's my BOM check using .seek() after the file is opened using codecs:

def skip_csv_bom(check_file): print("Checking for BOM") check_file.seek(0) if check_file.read(1)=='\ufeff': #This is how the the first 3 bytes are encoded in utf-8 print('Yes BOM') #by calling .read(1) the offending data has been skipped over else: print("No BOM, read file from beginning") check_file.seek(0) #This is needed since the above if statement iterates .read()

I also added the following functions to check to make sure that the tmp directory was deleted. This is helpful for people like me who frequently encountered failures.

def remove_tmp(): if os.path.isdir('tmp')==True: shutil.rmtree('tmp/') else: pass

def remove_tmp_prompt(): if os.path.isdir('tmp')==True: dlt = input("Delete 'tmp' directory? (y/n)").lower() if dlt != 'n': remove_tmp()
else: print("Manually remove 'tmp/'")

Also, more Windows strangeness. I got the following error which I'm fairly confident is caused by manipulating the Windows file system too quickly. My solution was to implement a time.sleep(.5) command and it went away. with codecs.open('tmp.tmp', "w", encoding='utf-8') as fout: File "E:\WinPython-64bit-3.4.3.7\python-3.4.3.amd64\lib\codecs.py", line 891, in open file = builtins.open(filename, mode, buffering) PermissionError: [Errno 13] Permission denied: 'tmp.tmp'

bsalinas commented 8 years ago

Are you getting an error that looks something like

$ python3 batch_idml_editor.py 
Saving to output1468333130/...
Traceback (most recent call last):
  File "batch_idml_editor.py", line 53, in <module>
    filename = filename.replace(rep['search'], row[header.index(rep['replace_key'])])
ValueError: 'First Name' is not in list

Using your txt file I was able to get that error.

changing

with codecs.open(csv_file, 'r', encoding='utf-8') as csvfile:

to

with codecs.open(csv_file, 'r', encoding='utf-8-sig') as csvfile:

I was able to get the problem to go away. Can you test that change in the script to see if it also goes away on your Windows computer (I'm running Mac OSX)?

pea-arthur commented 8 years ago

Oh, of course this is has been solved already! This works great on both BOM'd and non-BOM'd files. I looked at the source for codecs.py for how utf-8-sig works to verify it does the same thing I implemented, but it's not documented there.

I see 4 solutions: 1) Leave your code as is, and add a line to the README.md stating: "If you are a Windows user encountering ValueErrors, try using 'utf-8-sig' instead of 'utf-8' encoding."

2) Set 'utf-8-sig' as the default encoding type.

3) Add a try: & except ValueError: block to the code which either: a) alerts the user print("Please check that {} is encoded using 'utf-8' or try using 'utf-8-sig'".format(csv_file)) or b) restarts the code replacing 'utf-8' with 'utf-8-sig'.

4) Write or find an existing module which intelligently detects what type of encoding is present in csv_file and passes it on as a variable.

I apologize, I should be using GIT properly to make these commits myself for your review, would that be okay?

bsalinas commented 8 years ago

Hi @pea-arthur I think that we should go with option 2 (set utf-8-sig as the default). It doesn't seem to cause a problem with the original example, so it seems safe enough.

I am happy to make the change, or if you want to get your hands dirty with git, I'm happy to have you do a pull request (and walk you through the process if you need help).

And then we should try to figure out the tmp files in windows.

pea-arthur commented 8 years ago

Hi @bsalinas Option #2 is good. I forked the code yesterday and submitted a pull request, but my code is malformed as you can see, so no hard feelings if you don't accept it - this is my first one ever. As I mentioned in the pull request, I don't think we should try to solve every Windows file system issue.

Thank you again for starting this project. I understand .IDML much better now. Once my project is done, I will write a blog post and link back here. I would like to do some more advanced insertions and dynamic content creation so my next step is to get https://github.com/Starou/SimpleIDML working.

bsalinas commented 8 years ago

Looks good @pea-arthur ! Thanks for contributing! If you do end up creating a blog post that shows how to use it, it would be great to link to it from the README.

(Sorry for taking a bit of time to get back to you, I was moving last week)

goinvo / BatchIDMLGenerator

Check to strip extraneous .csv information? #2