uniVocity / univocity-parsers

uniVocity-parsers is a suite of extremely fast and reliable parsers for Java. It provides a consistent interface for handling different file formats, and a solid framework for the development of new parsers.
911 stars 250 forks source link

CSVParser appends whitespace at the beginning of each column #398

Closed HMazharHameed closed 4 years ago

HMazharHameed commented 4 years ago

I am new to this parser and I have a concern regarding CSV reading. When I read CSV, the parser appends whitespace at the beginning of each column, which I don't want. Is it by default parsing feature and we can't change it? or is there any method to handle this case?

Sample output:

3, "Gunnar Nielsen Aaby", 24 34 5656, NA, NA, Denmark, DEN, 1920 Summer, 1920, Summer, Antwerpen, Football, Football Men's Football, NA

jbax commented 4 years ago

Without any code that reproduces the error it's impossible to help you. Can you please provide a unit test?

HMazharHameed commented 4 years ago

Apologies for not providing the code. Secondly, I used different variations, but the result was the same, so I thought it might work like this.

Example Code:

      CsvParserSettings settings = new CsvParserSettings();
   settings.detectFormatAutomatically();
   settings.setIgnoreLeadingWhitespaces(false);
   settings.setIgnoreTrailingWhitespaces(false);
   settings.setKeepQuotes(true);
   settings.setQuoteDetectionEnabled(false);
   settings.setSkipEmptyLines(false);

   CsvParser parser = new CsvParser(settings);
      List<Record> allRecords = parser.parseAllRecords(new File("C:/Desktop/test.csv"));

      for(Record record : allRecords){
     recordList.add(record.toString());
   }
jbax commented 4 years ago

Remove these two lines if you don't want whitespaces around each value (your input has a whitespace before each one):

//   settings.setIgnoreLeadingWhitespaces(false);

As you provided me with one input row to parse, I had to comment out this line as well:

settings.detectFormatAutomatically();

As detection on small inputs is not reliable and the detected delimiter was a whitespace instead of a comma. If this affects you use

settings.detectFormatAutomatically(',', ';', 'and other characters');

Hope this helps.

HMazharHameed commented 4 years ago

I have removed these 2 lines and tried all the other possibilities, but the result is the same, which is why I have posted this issue here. And the dataset contains 212000 rows and every row has the same problem.

HMazharHameed commented 4 years ago

The issue below was copied from the Univocity website and shows the same problem I am talking about, you can confirm.

https://www.univocity.com/pages/univocity_parsers_csv.html#further-reading

Data in /examples/example.csv: Printing 6 rows Row 1 (length 5): [Year, Make, Model, Description, Price] Row 2 (length 5): [1997, Ford, E350, ac, abs, moon, 3000.00] Row 3 (length 5): [1999, Chevy, Venture "Extended Edition", null, 4900.00] Row 4 (length 5): [1996, Jeep, Grand Cherokee, MUST SELL! air, moon roof, loaded, 4799.00] Row 5 (length 5): [1999, Chevy, Venture "Extended Edition, Very Large", null, 5000.00] Row 6 (length 5): [null, null, Venture "Extended Edition", null, 4900.00]

Data in /examples/european.csv: Printing 6 rows Row 1 (length 5): [Year, Make, Model, Description, Price] Row 2 (length 5): [1997, Ford, E350, ac; abs; moon, 3000,00] Row 3 (length 5): [1999, Chevy, Venture "Extended Edition", null, 4900,00] Row 4 (length 5): [1996, Jeep, Grand Cherokee, MUST SELL! air; moon roof; loaded, 4799,00] Row 5 (length 5): [1999, Chevy, Venture "Extended Edition; Very Large", null, 5000,00] Row 6 (length 5): [null, null, Venture "Extended Edition", null, 4900,00]

jbax commented 4 years ago

This works for me just fine. Do you have other invisible characters on your file perhaps? Try providing the file encoding explicitly here, as in the example below (assuming UTF-8):

parser.parseAllRecords(new File("C:/Desktop/test.csv"), "UTF-8");

Also, calling record.toString() will add a whitespace in front of each value. Maybe that's what's confusing you (see RecordImpl.java line 618).

HMazharHameed commented 4 years ago

I have also tried this but still, the result is the same.

The thing that is confusing me is the results on your website. As I mentioned before you can see the results are also showing the whitespace at the beginning of every column.

https://www.univocity.com/pages/univocity_parsers_csv.html#further-reading

jbax commented 4 years ago

You are looking at the output of a toString() call. That's formatted for readability and the individual values returned by the parser DO NOT HAVE whitespaces in front of them.

If the error looks way too absurd to be true, it probably is.

There is absolutely nothing wrong with the parser.

On Tue, 16 Jun 2020 at 17:22, HMazharHameed notifications@github.com wrote:

I have also tried this but still, the result is the same.

The thing that is confusing me is the results on your website. As I mentioned before you can see the results are also showing the whitespace at the beginning of every column.

https://www.univocity.com/pages/univocity_parsers_csv.html#further-reading

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/uniVocity/univocity-parsers/issues/398#issuecomment-644598007, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWFQPXP35K7WS57DKUJKPTRW4QEJANCNFSM4N6H45JA .

HMazharHameed commented 4 years ago

But now I am not using toString() and storing records in the List and accessing it character by character and it still shows the same problem.

jbax commented 4 years ago

Can you please provide the full code for that, including whatever input you are parsing? You can copy a section of your file to a string and give it to the parser in a StringReader

On Tue, 16 Jun 2020 at 17:28, HMazharHameed notifications@github.com wrote:

But now I am not using toString() and storing records in the List and accessing it character by character and it still shows the same problem.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/uniVocity/univocity-parsers/issues/398#issuecomment-644601180, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWFQPSUWB4ZG4AO5RJR3YDRW4Q4HANCNFSM4N6H45JA .

HMazharHameed commented 4 years ago

I tried but still the same.

Let me show you the input excerpt and the basic code.

CsvParserSettings settings = new CsvParserSettings(); settings.detectFormatAutomatically(); settings.setIgnoreLeadingWhitespaces(false); settings.setIgnoreTrailingWhitespaces(false); settings.setKeepQuotes(true); settings.setQuoteDetectionEnabled(false); settings.setSkipEmptyLines(false);

List allRecords = parser.parseAllRecords(new File("C:/Desktop/test.csv")); for(Record record : allRecords){ System.out.println(record); }

P.S: As I said before, I have tried every possibility with the record, and here I show only a simple print to present that the results contain whitespaces

Input: "ID","Name","Sex","Age","Height","Weight","Team","NOC","Games","Year","Season","City","Sport","Event","Medal" "1","A, Dijiang","M",24,180,80,"China","CHN","1992 Summer",1992,"Summer","Barcelona","Basketball","Basketball Men's Basketball",NA "2","A Lamusi","M",23,170,60,"China","CHN","2012 Summer",2012,"Summer","London","Judo","Judo Men's Extra-Lightweight",NA "3","Gunnar Nielsen Aaby","M",24,NA,NA,"Denmark","DEN","1920 Summer",1920,"Summer","Antwerpen","Football","Football Men's Football",NA "4","Edgar Lindenau Aabye","M",34,NA,NA,"Denmark/Sweden","DEN","1900 Summer",1900,"Summer","Paris","Tug-Of-War","Tug-Of-War Men's Tug-Of-War","Gold" "5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 500 metres",NA "5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 1,000 metres",NA "5","Christine Jacoba Aaftink","F",25,185,82,"Netherlands","NED","1992 Winter",1992,"Winter","Albertville","Speed Skating","Speed Skating Women's 500 metres",NA "5","Christine Jacoba Aaftink","F",25,185,82,"Netherlands","NED","1992 Winter",1992,"Winter","Albertville","Speed Skating","Speed Skating Women's 1,000 metres",NA

Output "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "Games", "Year", "Season", "City", "Sport", "Event", "Medal" "1", "A, Dijiang", "M", 24, 180, 80, "China", "CHN", "1992 Summer", 1992, "Summer", "Barcelona", "Basketball", "Basketball Men's Basketball", NA "2", "A Lamusi", "M", 23, 170, 60, "China", "CHN", "2012 Summer", 2012, "Summer", "London", "Judo", "Judo Men's Extra-Lightweight", NA "3", "Gunnar Nielsen Aaby", "M", 24, NA, NA, "Denmark", "DEN", "1920 Summer", 1920, "Summer", "Antwerpen", "Football", "Football Men's Football", NA "4", "Edgar Lindenau Aabye", "M", 34, NA, NA, "Denmark/Sweden", "DEN", "1900 Summer", 1900, "Summer", "Paris", "Tug-Of-War", "Tug-Of-War Men's Tug-Of-War", "Gold" "5", "Christine Jacoba Aaftink", "F", 21, 185, 82, "Netherlands", "NED", "1988 Winter", 1988, "Winter", "Calgary", "Speed Skating", "Speed Skating Women's 500 metres", NA "5", "Christine Jacoba Aaftink", "F", 21, 185, 82, "Netherlands", "NED", "1988 Winter", 1988, "Winter", "Calgary", "Speed Skating", "Speed Skating Women's 1,000 metres", NA "5", "Christine Jacoba Aaftink", "F", 25, 185, 82, "Netherlands", "NED", "1992 Winter", 1992, "Winter", "Albertville", "Speed Skating", "Speed Skating Women's 500 metres", NA "5", "Christine Jacoba Aaftink", "F", 25, 185, 82, "Netherlands", "NED", "1992 Winter", 1992, "Winter", "Albertville", "Speed Skating", "Speed Skating Women's 1,000 metres", NA "5", "Christine Jacoba Aaftink", "F", 27, 185, 82, "Netherlands", "NED", "1994 Winter", 1994, "Winter", "Lillehammer", "Speed Skating", "Speed Skating Women's 500 metres", NA

jbax commented 4 years ago

As I said, you are using the result of record.toString, which is meant to be like that. The values in the record do not contain any whitespace. Try printing out each value individually with:

for(Record record : allRecords){ System.out.println("=========== NEXT ============"); String[] values = record.getValues(); for(String value : values){ System.out.println("[" + value + "]"); } }

On Tue, 16 Jun 2020 at 18:07, HMazharHameed notifications@github.com wrote:

I tried but still the same.

Let me show you the input excerpt and the basic code.

CsvParserSettings settings = new CsvParserSettings(); settings.detectFormatAutomatically(); settings.setIgnoreLeadingWhitespaces(false); settings.setIgnoreTrailingWhitespaces(false); settings.setKeepQuotes(true); settings.setQuoteDetectionEnabled(false); settings.setSkipEmptyLines(false);

List allRecords = parser.parseAllRecords(new File("C:/Users/mazha/Desktop/test.csv")); for(Record record : allRecords){ System.out.println(record); }

P.S: As I said before, I have tried every possibility with the record, and here I show only a simple print to present that the results contain whitespaces

Input:

"ID","Name","Sex","Age","Height","Weight","Team","NOC","Games","Year","Season","City","Sport","Event","Medal" "1","A, Dijiang","M",24,180,80,"China","CHN","1992 Summer",1992,"Summer","Barcelona","Basketball","Basketball Men's Basketball",NA "2","A Lamusi","M",23,170,60,"China","CHN","2012 Summer",2012,"Summer","London","Judo","Judo Men's Extra-Lightweight",NA "3","Gunnar Nielsen Aaby","M",24,NA,NA,"Denmark","DEN","1920 Summer",1920,"Summer","Antwerpen","Football","Football Men's Football",NA "4","Edgar Lindenau Aabye","M",34,NA,NA,"Denmark/Sweden","DEN","1900 Summer",1900,"Summer","Paris","Tug-Of-War","Tug-Of-War Men's Tug-Of-War","Gold" "5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 500 metres",NA "5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 1,000 metres",NA "5","Christine Jacoba Aaftink","F",25,185,82,"Netherlands","NED","1992 Winter",1992,"Winter","Albertville","Speed Skating","Speed Skating Women's 500 metres",NA "5","Christine Jacoba Aaftink","F",25,185,82,"Netherlands","NED","1992 Winter",1992,"Winter","Albertville","Speed Skating","Speed Skating Women's 1,000 metres",NA

Output "ID", "Name", "Sex", "Age", "Height", "Weight", "Team", "NOC", "Games", "Year", "Season", "City", "Sport", "Event", "Medal" "1", "A, Dijiang", "M", 24, 180, 80, "China", "CHN", "1992 Summer", 1992, "Summer", "Barcelona", "Basketball", "Basketball Men's Basketball", NA "2", "A Lamusi", "M", 23, 170, 60, "China", "CHN", "2012 Summer", 2012, "Summer", "London", "Judo", "Judo Men's Extra-Lightweight", NA "3", "Gunnar Nielsen Aaby", "M", 24, NA, NA, "Denmark", "DEN", "1920 Summer", 1920, "Summer", "Antwerpen", "Football", "Football Men's Football", NA "4", "Edgar Lindenau Aabye", "M", 34, NA, NA, "Denmark/Sweden", "DEN", "1900 Summer", 1900, "Summer", "Paris", "Tug-Of-War", "Tug-Of-War Men's Tug-Of-War", "Gold" "5", "Christine Jacoba Aaftink", "F", 21, 185, 82, "Netherlands", "NED", "1988 Winter", 1988, "Winter", "Calgary", "Speed Skating", "Speed Skating Women's 500 metres", NA "5", "Christine Jacoba Aaftink", "F", 21, 185, 82, "Netherlands", "NED", "1988 Winter", 1988, "Winter", "Calgary", "Speed Skating", "Speed Skating Women's 1,000 metres", NA "5", "Christine Jacoba Aaftink", "F", 25, 185, 82, "Netherlands", "NED", "1992 Winter", 1992, "Winter", "Albertville", "Speed Skating", "Speed Skating Women's 500 metres", NA "5", "Christine Jacoba Aaftink", "F", 25, 185, 82, "Netherlands", "NED", "1992 Winter", 1992, "Winter", "Albertville", "Speed Skating", "Speed Skating Women's 1,000 metres", NA "5", "Christine Jacoba Aaftink", "F", 27, 185, 82, "Netherlands", "NED", "1994 Winter", 1994, "Winter", "Lillehammer", "Speed Skating", "Speed Skating Women's 500 metres", NA

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/uniVocity/univocity-parsers/issues/398#issuecomment-644621519, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWFQPSXUJ4P4C7NZNI2FS3RW4VMRANCNFSM4N6H45JA .

HMazharHameed commented 4 years ago

I guess I'm not able to explain my problem and its getting worse. Anyway, thank you so much for the time let me try another resource.