microsoft / vscode-data-wrangler

Other
430 stars 19 forks source link

iso country code "NA" recognized as "N/A" #219

Open tachsin opened 3 months ago

tachsin commented 3 months ago

Type: Bug

Behaviour

Expected vs. Actual

I have the NA iso country code for Namibia, but I guess it recognizes as N/A and says "Missing value"

Steps to reproduce:

  1. Have a csv file having "NA" in a column

Diagnostic data

{
  "installed": {
    "pandas": "2.0.3",
    "pyarrow": "11.0.0"
  },
  "required": {
    "pandas": "0.25.2"
  },
  "unsatisfied": []
}

Extension version: 1.4.1 VS Code version: Code 1.90.2 (Universal) (5437499feb04f7a586f677b155b039bc2b3669eb, 2024-06-18T22:37:41.291Z) OS version: Darwin arm64 24.0.0 Modes:

System Info |Item|Value| |---|---| |CPUs|Apple M2 (8 x 2400)| |GPU Status|2d_canvas: enabled
canvas_oop_rasterization: enabled_on
direct_rendering_display_compositor: disabled_off_ok
gpu_compositing: enabled
multiple_raster_threads: enabled_on
opengl: enabled_on
rasterization: enabled
raw_draw: disabled_off_ok
skia_graphite: disabled_off
video_decode: enabled
video_encode: enabled
webgl: enabled
webgl2: enabled
webgpu: enabled| |Load (avg)|4, 3, 3| |Memory (System)|16.00GB (0.26GB free)| |Process Argv|--crash-reporter-id 8dabb8e3-418e-47c2-96f1-ae6d5db5476f| |Screen Reader|no| |VM|0%|
A/B Experiments ``` vsliv368cf:30146710 vspor879:30202332 vspor708:30202333 vspor363:30204092 vscoreces:30445986 vscod805:30301674 binariesv615:30325510 vsaa593:30376534 py29gd2263:31024239 c4g48928:30535728 azure-dev_surveyonecf:30548226 962ge761:30959799 pythongtdpath:30769146 welcomedialogc:30910334 pythonnoceb:30805159 asynctok:30898717 pythonregdiag2:30936856 pythonmypyd1:30879173 2e7ec940:31000449 pythontbext0:30879054 accentitlementsc:30995553 dsvsc016:30899300 dsvsc017:30899301 dsvsc018:30899302 cppperfnew:31000557 dsvsc020:30976470 pythonait:31006305 jchc7451:31067544 chatpanelc:31048052 dsvsc021:30996838 da93g388:31013173 pythoncenvpt:31062603 a69g1124:31058053 dvdeprecation:31068756 dwnewjupyter:31046869 newcmakeconfigv2:31071590 legacy_priority:31082724 ```
pwang347 commented 3 months ago

Hi @tachsin, thank you for reporting this issue!

Note that this behaviour is happening because "NA" is one of the default strings that Pandas parses as missing values in CSV files. As a workaround, it is possible to configure this behaviour by modifying the import code step.

To edit the import step, you can first click on the first cleaning step:

image

And then editing the code directly to opt out of this default NA value parsing with keep_default_na=False and pressing the Update button:

image

To specify a custom list of values to be parsed as missing, you can also pass in something like na_values=["", "null"]. See the documentation for more information about these settings.

--

Workaround aside, we'd love to streamline the import flow to simplify the configuration process for settings like these. It's an item on the roadmap and we'll keep you posted for any updates. Hope this helps!