Determine if data has delimiters from cb1

chere005 commented 5 years ago

Hello,

Thanks for the work on libcsv. One feature I like in cb2 is it gives the character that was used as the new line to differentiate if it was \r or \n. When cb1 gets called, we get the data with the delimiters removed, which is great, but there does not seem to be any indication if there was a delimiter present or not. While according to the csv specification 1 and "1" are the same thing, it would be nice to be able to interpret the 1 as a number and the "1" as a string.

bobhairgrove commented 5 years ago

I think it would be difficult to expose this feature ... whether or not a field is quoted is determined by code located deep inside the csv_parse() function. And I doubt that it would prove to be very useful, mainly because the software generating the CSV files can usually be configured to quote everything or else just the fields containing other quote characters (which need to be escaped, of course).

In one of my applications, for example, I received several different CSV files from different sources that had to be merged into one database. All had more or less the same format, but some exporters quoted the data and others didn't. Simply relying on whether or not there was a quote character wouldn't have worked because I had no control over the various programs generating the CSV files.

What I did was to set up a configuration file in XML format containing regular expressions for each field. Most of the time, one will be importing CSV files into a database or spreadsheet and expect each column to receive the associated data type. In your callback function cb1 you can check the data against the regular expression you define for the current column. This is also much more powerful than merely checking for quotes, and it ensures consistency of the import rather than letting the generator of the CSV file do that. With regular expressions, you can also allow or disallow various different date formats, field lengths, and other patterns.

Character sets can also prove to be a problem if some files are encoded as UTF-8 and others in some 8-bit ISO character set. If you have determined, for example, that the current import file is encoded in a different encoding than required for the import, you can convert the data in your callback function before inserting it into the database.

But maybe Robert Gamble can give you a better answer?

chere005 commented 5 years ago

@bobhairgrove thanks for the input.

I was thinking about just passing the csv_parser struct through the data pointer and just checking the 'quoted' state as a workaround. I just think this is something that is useful for some users who don't necessarily use CSV generated from excel etc.. Some users really care if the value in the file is 5 or "5" even if the RFC spec does not. Data scientists with huge datasets definitely aren't using every day software. The difference is basically whether we store it as an int or string, and this feature can be turned on/off. It is absolutely and undeniably useful. So the "different sources do different things" doesn't mean there isn't an application for this.

The XML for types configuration is interesting, hopefully you're not running a regex every cb1, that would be awfully slow. If only during first row to get types, yes this could be useful, but a rather specific application that again doesn't apply to what I am doing.

My software elegantly handles character encodings already, this is a non-issue.

In short, I know what I want, a version of cb1 that has a flag for if the field was quoted so I don't have to do a hack and pass the csv_parser struct around. (I'm not even sure if this hack is safe but from reading through the code I think it is, and plan to test it in the next few weeks).

bobhairgrove commented 5 years ago

@chere005 The idea about examining the csv_parser struct seems like a good one to me. But I'm pretty sure that it won't work as it is now and would involve a change to the sources because the csv_parse() function doesn't set the corresponding member quoted of that struct until just before the function returns (when it is much too late for this to do anyone any good.)

So I'll leave this discussion now and let the author of libcsv take over from here.

rgamble commented 5 years ago

libcsv does not expose information about state encountered during the parsing process that isn't relevant to the final result. This includes whether the field was quoted, what unquoted space characters appeared before the field, etc. While I'm not opposed to exposing this information, the question is best way to do so. The ideas that spring to mind are:

Create a new csv_parse2 function that accepts a cb1 function that contains an extra field that includes this information (perhaps as a pointer to a struct that provides various extra information).
Add a new option flag to indicate that the provided cb1 function should actually accept a fourth argument that contains this information.
Add a function to register a new callback that is invoked immediately after cb1 with the extra field information (which could also include the same information that cb1 is called with).

Each of these solutions has drawbacks, I'm open to other ideas.

As a quick hack, it should be trivial to modify SUBMIT_FIELD in libcsv.c to e.g. call cb1 with a null third-argument when the field is quoted and the provided data pointer (which should be non-null for this purpose) when the field is not quoted.

chere005 commented 5 years ago

@rgamble I really appreciate you taking the time to get back to me and for brainstorming solutions. I have been using libcsv for a couple years now and I have been very happy so far.

I agree that the most prominent use case is the resulting data. This is absolutely a power user feature, someone who really understands the underlying data and wants more control. As such, one should not stumble upon this code trying to do basic tasks, but should explicitly seek these features out, and we should consider this in the design/docs. Going further to expose the unquoted spaces/tabs etc may be of much use as well, but is not my most immediate primary concern. While doing this, why not provide all data anyway, though.

1 and 2 accomplish basically the same thing, and i tend to avoid macros when possible. 3 sounds like it could impact performance which is the most important reason that I am using libcsv.

I need to spend some more time reading the source and perhaps trying a few things to better comment on 1/2, I will get back to you next week with more feedback.

I cannot modify the source per LGPL, otherwise I may have considered some small hacks already. I am happy to help with pull requests, both contributing and reviewing/testing. Can you confirm if checking the state of csv_parser won't work like @bobhairgrove suggested? This would allow me to hack it without modifying the source.. To be fair, if we plan on doing this properly and making it an official part of libcsv, I am in no rush for such hacks.

chere005 commented 5 years ago

This is perhaps something of interest:

https://locklessinc.com/articles/overloading/

chere005 commented 5 years ago

Another idea, a struct to hold the state knowledge that is updated at the time the callback is executed and can be optionally queried. It could be passed through data* as well.

rgamble commented 5 years ago

Regarding calling cb1 with a pointer to the parser object to extract quote information: while I'm sure that could be made to work, I would advise against it as the particulars of the internal state information are not intended to be relied on by the client and could change in a future release.

Regarding the license, what is the perceived limitation of modifying the source code? While I am not a lawyer and nothing I say should be construed as legal advice, my understanding is that you can do anything with the source code that the authors can do short of re-licensing the source code (which could be done with the permission of all the contributors as the joint copyright holders).

chere005 commented 5 years ago

@rgamble thank you for confirming that checking the parser object is not a viable solution.

I am not a lawyer either. Upon reading LGPL again, there may be provisions to allow this. I could follow up further in that direction if we don't make this change here, but I'd rather do this via this channel in case this work benefits any others.

bobhairgrove commented 5 years ago

A change which presumably would not break any existing code would be to update the members of the csv_parser struct within the body of csv_parse() just before calling SUBMIT_FIELD (and perhaps also before calling SUBMIT_ROW). The body of the function csv_parse doesn't seem to use the struct members quoted, pstate, spaces, or entry_pos at all since everything is done with local variables.

Alternatively, one could consider doing away with the locals and just use the struct members instead. That way, clients would have full access to all of the internal bookkeeping during the parsing. To me, this would be a more consistent design.

chere005 commented 5 years ago

@bobhairgrove I would not mind this approach, but it does feel a bit hacky from the user's perspective. This is probably a good change regardless of adding an overloaded cb1 somehow; it's the behavior I would have expected from csv_parser without digging into the code.

rgamble / libcsv

Determine if data has delimiters from cb1 #17