mrodrig / json-2-csv

Convert JSON to CSV *or* CSV to JSON!
https://mrodrig.github.io/json-2-csv
MIT License
415 stars 57 forks source link

No possibility to use unescaped dots in headers #245

Closed Travellerme closed 7 months ago

Travellerme commented 7 months ago

Background Information

The issue I'm reporting is with:

I have...

Expected Behavior

Please provide a possibility to define this option during csv generation.

escapeNestedDots = false results in: ['a.a', 'a.b.c', 'a.b.c.d']

Actual Behavior

The escapeNestedDots option is hardcoded for the https://www.npmjs.com/package/deeks library. image

As a result, it's not possible to generate csv without a preceding \ character in case one of json key has a . character in a key name.

escapeNestedDots = true results in: ['a\\.a', 'a\\.b.c', 'a\\.b.c\\.d']

Data Sample

CSV:

{
    "a.a": "1",
    "a.b": {
        "c": "2",
        "c.d": "3"
    }
 }

Code Example

// Please include a simple example to replicate the issue
import { json2csv } from 'json-2-csv'
const list = {
    "a.a": "1",
    "a.b": {
        "c": "2",
        "c.d": "3"
    }
 }
json2csv(list, { unwindArrays: false, emptyFieldValue: '', checkSchemaDifferences: false, expandNestedObjects: true, expandArrayObjects: true })
mrodrig commented 7 months ago

Hi @Travellerme, thanks for opening this. I can definitely see the use case for wanting to have keys with nested dots without having them appear escaped in the output CSV header. The reason the dots are currently being escaped is because it was the only way that I could ensure that the correct values were being retrieved for edge cases like a document like this:

{
  "a.b": { "c": 0 },
  "a": { "b": { "c": 2 } }
}

While it seems unlikely to happen, it's a valid possibility and without escaping the nested dots, there was no way to guarantee the correct key path's value was retrieved. Specifically, would a.b.c reference 0 or 2? Whereas escaping the nested dot, a\\.b.c provides some assurance that we're looking for 0. Some libraries make assumptions that don't always pan out, but I relied on escaping the nested dot to be certain that the CSV would be correct. I agree that it doesn't look the prettiest in the output though.

I'm fairly certain that the functionality you're looking for is still entirely possible, but it is a bit more involved to implement since it will require reworking the deeks module a bit to enable tracking the separate components of each key path that's identified in order to keep the same level of certainty. Additionally, the doc-path module will need an update to add support for a string[] to be specified with the key path components (eg. the key strings at each layer) instead of the string that it currently expects for the key path.

I'm open to working to support this, but it's likely going to be a longer timeframe than some of the other open issues that have a much more concise scope, and my time to support this is limited to after my full-time work. If you, or someone else that would like this sooner, is up for it, then I'd also be happy to review and merge pull requests for any of the updates needed to add support for this.

Thanks for your understanding. šŸ™‚

Travellerme commented 7 months ago

Hello @mrodrig Thanks for your clarification! I understood the edge case you mentioned and it's more than valid.

I have a couple of quick ways to resolve it and would like to know your opinion about them:

  1. As initially proposed, add a possibility to override escapeNestedDots but in case we have several equal headers after it - throw an error. I think it will not be a breaking change, as the error could be thrown only if someone passes the new option for overriding escapeNestedDots. The documentation can describe the error so that developers will be aware of it.
  2. You can still use escapeNestedDots = true as it's done right now, so nothing will be changed internally for the provided example, but you can add one more option such as escapeHeaders = boolean, and if this option is provided, before the final generation of csv file you can manually escape / characters in every header something like const header = params.header.replaceAll('\\', '')

image

Also, the disclaimer about possible header name duplication can be added to the documentation for a new option, so developers should accept this risk in case they use it.

Please let me know what you think about it. Thank you

mrodrig commented 7 months ago

Hi @Travellerme, Thanks for your thoughtful response and additional options I hadn't considered. You've raised some great points with these additional paths forward.

For option 1 you mentioned, I think that is reasonable with the possibility that an error could be thrown for data where multiple keys have equivalent paths when not escaped. The only possible issue that I see with this, would be that the doc-path module depends on any keys with nested dots to be escaped in order for it to navigate through the object and retrieve the associated value. For example:

const { evaluatePath } = require('doc-path')
const data = {
    "a.b": {
        "c": "2"
    }
};
console.log( evaluatePath(data, 'a.b.c') );   // => prints: null
console.log( evaluatePath(data, 'a\\.b.c') ); // => prints: "2"

I think it's possible to update doc-path to support that, but I'd want to be cautious about how it's implemented so there isn't a performance degradation that would impact dependent packages or apps.

Based on that, I've been giving more thought to option 2 that you outlined with an option. It seems this would be the quickest and probably the simplest way to achieve this with minimal risk. Since it would effectively be updating more of the CSV "presentation" post-processing of the provided JSON, this seems like it would be ideal given that there's less risk of incorrect data being retrieved. One possible downside that I see is that any keys that have \\ present elsewhere in them would be updated to have those instances removed. It seems like a reasonable mitigation to decrease that likelihood would be to do something like .replaceAll(/\\\./g, '.') so only \\. is replaced with . inside the keys. This potential side effect could also be documented alongside the option flag to make the possibility clearer.

I think based on that I'm leaning towards option 2, but am open to option 1 if you think that's a better implementation of this for your use case. Thanks again!

Travellerme commented 7 months ago

Hi @mrodrig I agree that option 2 seems to be the most painless and easy to implement. I would appreciate it if you managed to implement this feature soon. Thank you!

mrodrig commented 7 months ago

Thanks @Travellerme for your collaboration on this! I was able to make the simple changes necessary to support this tonight and published version 5.2.0 with a new escapeHeaderNestedDots option. When set to false, it will replace the escaped dots in header keys with unescaped dots so the CSV is more readable.

Prior to this option, console.logging the generated CSV from your example would result in:

a\.a,a\.b.c,a\.b.c\.d
1,2,3

However, with version 5.2.0 with the new escapeHeaderNestedDots option set to false, the CSV output when console.logging it will look like this:

a.a,a.b.c,a.b.c.d
1,2,3

For others for future reference, here's an updated code snippet with the option specified:

import { json2csv } from 'json-2-csv'

const list = {
    "a.a": "1",
    "a.b": {
        "c": "2",
        "c.d": "3"
    }
};

const csv = json2csv(list, {
    escapeHeaderNestedDots: false,
    unwindArrays: false,
    emptyFieldValue: '',
    checkSchemaDifferences: false,
    expandNestedObjects: true,
    expandArrayObjects: true
})

console.log(csv);

Hopefully this helps! šŸ™‚

Travellerme commented 7 months ago

Thanks @mrodrig a lot! It really helped me!

Travellerme commented 6 months ago

Hi @mrodrig Sorry for resurrecting it, but I've just updated the version to the latest 5.4.0 and it seems the problem still exists.

Input:

[
  {
    "tempFile1": "blabla/tempFile1",
    "a": 1,
    "b": {
      "b1": [
        "b11",
        "b12"
      ]
    },
    "c": {
      "c1": {
        "c2.c4": "c_value"
      }
    }
  },
  {
    "tempFile2": "blabla/tempFile2",
    "a": 1,
    "b": {
      "b1": [
        "b11",
        "b12"
      ]
    },
    "c": {
      "c1": {
        "c2.c4": "c_value"
      }
    }
  },
  {
    "tempFile3": "blabla/tempFile3",
    "a": 1,
    "b": {
      "b1": [
        "b11",
        "b12"
      ]
    },
    "c": {
      "c1": {
        "c2.c4": "c_value"
      }
    }
  }
]

Output

    tempFile1,a,b.b1,c.c1.c2\.c4,tempFile2,tempFile3
    blabla/tempFile1,1,"[""b11"",""b12""]",c_value,,
    ,1,"[""b11"",""b12""]",c_value,blabla/tempFile2,
    ,1,"[""b11"",""b12""]",c_value,,blabla/tempFile3

As you may see, the c.c1.c2\.c4 was not escaped.

thx

mrodrig commented 6 months ago

Hi @Travellerme, can you please confirm that you're specifying escapeHeaderNestedDots: false in the options you're passing to the json2csv method? The default value is true, so setting it to false will trigger the logic to remove any escaped dots in the header keys. I tried the data sample you provided with that option set to false and I'm not seeing the escaped dot in the CSV output with the same example code from https://github.com/mrodrig/json-2-csv/issues/245#issuecomment-1960681667:

tempFile1,a,b.b1,c.c1.c2.c4,tempFile2,tempFile3
blabla/tempFile1,1,"[""b11"",""b12""]",c_value,,
,1,"[""b11"",""b12""]",c_value,blabla/tempFile2,
,1,"[""b11"",""b12""]",c_value,,blabla/tempFile3
Travellerme commented 6 months ago

Hi @mrodrig , sorry I made a mistake. For some reason, I thought escapeHeaderNestedDots must be equal to true to avoid escaping. My bad, everything works just fine, thx!

MathRobin commented 6 months ago

Just to notify, for some users, it's a breaking changes. Lot of my scripts were adapted to dotted headers (not my choice). So when update to latest, Pow! nothing works anymore. I think v6.0 would have been more indicated. My 2cts.

Fortunately, my CI has detected it before going to prod. But i think it was a breaking changes.

mrodrig commented 6 months ago

Hi @MathRobin, I'm happy to take a look if you have a sample that replicates your use case breaking from 5.0.1 to 5.4.0, but the behavior for this is behind the escapeHeaderNestedDots option, which has a default value set to preserve the existing behavior. All of my test cases that had nested dots in their header keys continued to pass with this change, and new test cases defined for this new option also worked as expected, so there's no indication from my testing that this had any breaking changes.