stephenberry / glaze

Extremely fast, in memory, JSON and interface library for modern C++
MIT License
1.23k stars 123 forks source link

CSV parsing does not take into account non raw_strings (it fails if ',', '[' or ']' are inside quotes) #1445

Open sjanel opened 1 day ago

sjanel commented 1 day ago

Hi !

I wanted to use glaze to parse the currencies from this CSV with this structure:

struct CurrencyCSV {
  vector<string> Entity;
  vector<string> Currency;
  vector<string> AlphabeticCode;
  vector<string> NumericCode;
  vector<string> MinorUnit;
  vector<string> WithdrawalDate;
};

However, the parser interprets , (and [ ]) if they are inside brackets in a value. For instance, these lines fail:

"MOLDOVA, REPUBLIC OF",Russian Ruble,RUR,810,,1993-12
"FALKLAND ISLANDS (THE) [MALVINAS]",Falkland Islands Pound,FKP,238,2,

I think that this behavior could be expected if .raw_string is true, but it should work if we set .raw_string to false.

WDYT ?

If I have the time, I will try to make a PR.

stephenberry commented 21 hours ago

Actively working on this issue here: #1446

stephenberry commented 21 hours ago

I think the string parsing has been fixed. It also supports escaped quotes per the CSV specification, which were needed to parse this document. However, there is a bug where it parses the WithdrawalDate into the MinorUnit. I have to work on some other stuff at the moment. If you want to try to fix this from the csv_currency branch, that would be great! Otherwise, I'll get back to this when I'm able.

sjanel commented 16 hours ago

I think the string parsing has been fixed. It also supports escaped quotes per the CSV specification, which were needed to parse this document. However, there is a bug where it parses the WithdrawalDate into the MinorUnit. I have to work on some other stuff at the moment. If you want to try to fix this from the csv_concurrency branch, that would be great! Otherwise, I'll get back to this when I'm able.

I think I found the bug, and fixed it in this commit. Feel free to take it into your branch. It's because we are skipping the trailing ',' in both the from<CSV for string_t and in the main loop of line 616, whereas for other type parsing we don't skip the commas, so I decided to just remove it from the string_t and it seems to work.

However, I did not see any test for rowwise parsing and I cannot make my commented test pass (see csv_test.cpp:606-607). Am I calling it incorrectly ?

stephenberry commented 15 hours ago

Thanks, I got your fix on the branch so that it parses column wise correctly. I'll look at your commented test now.

stephenberry commented 15 hours ago

I updated the rowwise test on the csv_currency branch. The first issue was that it was still reading in a column wise CSV file. Now it write out the data in rowwise format and then tries to read that back in. The outstanding issue is that our string writing for CSV does not escape quotes.

stephenberry commented 15 hours ago

I'm going to merge the current fixes for column wise support. But, I'm going to keep this issue alive until rowwise support and escaped writing has been added. I just want to get these changes merged as a first step.

stephenberry commented 15 hours ago

I could add you as a contributor to Glaze if you'd like to make branches directly in Glaze for these CSV fixes.