gsethi / addama

Automatically exported from code.google.com/p/addama
Apache License 2.0
1 stars 0 forks source link

fs-workspaces-svc: Uploading a TSV file generated in Windows Excel may have formatting issues #22

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Upload the attached file through the fs-workspaces-svc or jcr-workspaces-svc
2. Try to process the file in a script expecting normal TSV output
3. File appears empty

What is the expected output? What do you see instead?
The file is empty when uploaded into the workspace.  ISB research programmer 
has a procedure to remove Windows characters from the file before starting to 
upload (Perl).  He reports that files are processed normally if they are 
converted before upload.

Original issue reported on code.google.com by hrov...@gmail.com on 14 Nov 2010 at 6:33

GoogleCodeExporter commented 8 years ago

Original comment by hrov...@gmail.com on 14 Nov 2010 at 6:35

Attachments:

GoogleCodeExporter commented 8 years ago
Interestingly I had issues uploading the file into this issue.  I had to zip it 
first.  May not be related.

Original comment by hrov...@gmail.com on 14 Nov 2010 at 6:36

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
The file is notable for having old-style Mac line endings: each line other than 
the last is terminated with a CR (character 0x0D), rather than the Windows/IBM 
CR/LF pair (0x0D 0x10) or Linux-style LF (0x10).

Importantly, LF is the character that C++ and Java refer to as \n. A line 
terminated only with a CR, the \r character, is not detected as terminated. 
Python automatically expands \n to \r\n on Windows, and to the common line 
ending for the file if the file is loaded with the U ("universal") mode set, so 
Python will automatically correct this if the file is loaded in "rU" mode (read 
universal) and the \r line endings will be understood as \n.

Java BufferedReader treats \r as an end-of-line character, as do similar 
functions in the Java standard library. If using a Scanner, however, \r must be 
explicitly expressed as a delimiter or it won't treat it as one.  I don't know 
off the top of my head how Java handles line endings otherwise.

Python only performs line ending conversion if universal (U) is part of the 
file mode, so watch out.

C++, naturally, provides no support for foreign line endings, so it is very 
easy to shoot yourself in the face in such circumstances.

What scripts are trying to process this so-called blank file? A file with no 
line breaks may function similarly to a blank file, since something trying to 
interpret it as a TSV will identify only a header row.

Original comment by anorberg...@gtempaccount.com on 30 Nov 2010 at 6:10

GoogleCodeExporter commented 8 years ago
Yep, bug verified. Perl scripts expect LF characters to end a line, and Excel 
for Mac uses CR.

Suggested fix: convert CR to LF when writing text files to the workspace. This 
can be done fairly trivially with FilterInputStream.

Original comment by anorberg...@gtempaccount.com on 1 Dec 2010 at 1:17