rgamble / libcsv

Fast and flexible CSV library written in pure ANSI C that can read and write CSV data.
GNU Lesser General Public License v2.1
181 stars 40 forks source link

Suggestion: Add a "CSV_ABORT" option flag which would cause csv_parse() to stop parsing (set this in the callback) #27

Open bobhairgrove opened 2 years ago

bobhairgrove commented 2 years ago

It looks like there isn't any way to stop csv_parse() from running all the way to the end of the data. I am doing input validation in my "notify field" callback function.

If certain errors occur (not errors which would cause csv_parse() to stop anyway, but data validation errors -- such as regular expression matching fails), it would be nice to set some kind of "abort" flag in the csv_parser->options struct member which would be checked within the main parsing loop and return from the function if set (after cleaning up memory allocations, etc.). Since I still have access to the parser struct during processing, I could simply set the additional (to be determined) "CSV_ABORT" flag in the options. Anything else done to the parser struct would probably not work, or end up being very messy, I think.

This would be very useful if the calling code throws a C++ exception, for example -- throwing an exception would not prevent csv_parse() from doing its thing until it runs out of data.

Another use case would be for parsing 1st line field headers. Since the headers are only meaningful to the application using libcsv, and can be missing, any field could theoretically contain embedded newline characters (although they probably shouldn't). libcsv would be able to parse these, and when the first real end-of-line is reached, one might want to stop parsing. Otherwise, using fgets, etc. to look for a new line is bound to fail if any of the headers have such embedded new lines.

bobhairgrove commented 2 years ago

I suppose that running libcsv in a single-threaded process would imply that if an exception were thrown in the callback function, execution would exit the csv_parse() function and continue in the catch block, so one wouldn't have to worry about having the callback function called afterwards.

However, since not all applications use the C++ exception mechanism, it would be useful to have such an abort() function.

bobhairgrove commented 2 years ago

Another way of implementing this as a quick-and-dirty hack would be to (mis)use the malloc_func member of the csv_parser struct which is not used anywhere else in the library. However, the code in the csv_parse() function would still have to be changed to check that.

An additional error code, perhaps CSV_EABORTED, would need to be added which would be set in the status member of the parser struct when the csv_parse function discovers the new option.

bobhairgrove commented 2 years ago

I went ahead and implemented this. Here's a patch for what I did:

diff -u ./csv.h ../PATCH_for_abort_flag/csv.h
--- ./csv.h 2021-08-20 16:36:46.000000000 +0200
+++ ../PATCH_for_abort_flag/csv.h   2022-08-16 11:40:15.827490000 +0200
@@ -31,22 +31,25 @@
 #define CSV_RELEASE 3

 /* Error Codes */
-#define CSV_SUCCESS 0
-#define CSV_EPARSE 1   /* Parse error in strict mode */
-#define CSV_ENOMEM 2   /* Out of memory while increasing buffer size */
-#define CSV_ETOOBIG 3  /* Buffer larger than SIZE_MAX needed */
-#define CSV_EINVALID 4 /* Invalid code,should never be received from csv_error*/
+#define CSV_SUCCESS  0
+#define CSV_EPARSE   1   /* Parse error in strict mode */
+#define CSV_ENOMEM   2   /* Out of memory while increasing buffer size */
+#define CSV_ETOOBIG  3   /* Buffer larger than SIZE_MAX needed */
+#define CSV_EABORTED 4   /* Parsing was aborted */
+#define CSV_EINVALID 5   /* Invalid code,should never be received from csv_error*/

 /* parser options */
-#define CSV_STRICT 1    /* enable strict mode */
-#define CSV_REPALL_NL 2 /* report all unquoted carriage returns and linefeeds */
-#define CSV_STRICT_FINI 4 /* causes csv_fini to return CSV_EPARSE if last
-                             field is quoted and doesn't containg ending 
-                             quote */
-#define CSV_APPEND_NULL 8 /* Ensure that all fields are null-terminated */
+#define CSV_STRICT         1 /* enable strict mode */
+#define CSV_REPALL_NL      2 /* report all unquoted carriage returns and linefeeds */
+#define CSV_STRICT_FINI    4 /* causes csv_fini to return CSV_EPARSE if last
+                                field is quoted and doesn't containg ending
+                                quote */
+#define CSV_APPEND_NULL    8 /* Ensure that all fields are null-terminated */
 #define CSV_EMPTY_IS_NULL 16 /* Pass null pointer to cb1 function when
                                 empty, unquoted fields are encountered */
+#define CSV_ABORT         32 /* Flag which is checked in the csv_parse() function
+                                with each iteration of the main loop. */

 /* Character values */
diff -u ./libcsv.c ../PATCH_for_abort_flag/libcsv.c
--- ./libcsv.c  2021-08-20 16:36:46.000000000 +0200
+++ ../PATCH_for_abort_flag/libcsv.c    2022-08-16 11:54:44.455013000 +0200
@@ -74,7 +74,8 @@
                              "error parsing data while strict checking enabled",
                              "memory exhausted while increasing buffer size",
                              "data size too large",
-                             "invalid status code"};
+                             "parsing aborted",
+                             "invalid status code" };

 int
 csv_error(const struct csv_parser *p)
@@ -164,6 +165,8 @@
 {
   if (p == NULL)
     return -1;
+  if (p->status == CSV_EABORTED)
+    return -1;

   /* Finalize parsing.  Needed, for example, when file does not end in a newline */
   int quoted = p->quoted;
@@ -283,7 +286,8 @@
 {
   if (p == NULL) return 0;
   if (p->realloc_func == NULL) return 0;
-  
+  if (p->status == CSV_EABORTED) return 0;
+
   /* Increase the size of the entry buffer.  Attempt to increase size by 
    * p->blk_size, if this is larger than SIZE_MAX try to increase current
    * buffer size to SIZE_MAX.  If allocation fails, try to allocate halve 
@@ -346,6 +350,13 @@
   }

   while (pos < len) {
+    /* Check the abort flag: */
+    if (p->options & CSV_ABORT) {
+      p->status = CSV_EABORTED;
+      p->quoted = quoted, p->pstate = pstate, p->spaces = spaces, p->entry_pos = entry_pos;
+      return pos;
+    }
+
     /* Check memory usage, increase buffer if necessary */
     if (entry_pos == ((p->options & CSV_APPEND_NULL) ? p->entry_size - 1 : p->entry_size) ) {
       if (csv_increase_buffer(p) != 0) {
bobhairgrove commented 2 years ago

Oops ... didn't mean to close this right now!