Integrate CSV linter - Githubissues

wesley-dean-flexion commented 2 months ago

tl;dr: I would like to add csvclean (from csvkit) as a linter. I'm happy to do the work if people think this is a good idea.

Is your feature request related to a problem? Please describe.

One of my repos includes CSV files and they can sub-optimal. Just as we have linters (and reformaters) for JSON, XML, and YAML, I would like to add a CSV linter.

Describe the solution you'd like

There exists a package, csvkit, that includes a tool to lint and cleanup CSV files:

https://csvkit.readthedocs.io/en/latest/scripts/csvclean.html

I would like to add CSV_CSVCLEAN (the name isn't consequential to me; I just picked it because of YAML_YAMLLINT) that would lint the list of files. When run with APPLY_FIXES, it would not include the --dry-run flag to csvclean; when run without APPLY_FIXES set, it would include the --dry-run flag).

Running csvclean on a CSV file results in two files being created, per the documentation

Outputs [basename]_out.csv and [basename]_err.csv, the former containing all valid rows and the latter containing all error rows along with line numbers and descriptions

I noticed that running csvclean on a known messy file (i.e., one that produces errors due to being not totally valid) will NOT set $? but it will generate [basename]_err.csv, so something like this might be helpful:

csvcleanwrapper() {
  csvkit "$@" "$(-z "$APPLY_FIXES" ] && echo "--dry-run")" -- - && [ -e stderr_err.csv ]
}

(so $? is set, $@ would have the additional arguments, $APPLY_FIXES is set to something when we want to fix stuff, etc.. point: a little "syntactic sugar" could be helpful in making it work the way we want.)

Describe alternatives you've considered

I haven't thought through very many alternatives. I did look through prettier to see if it could clean up CSV like it can for YAML and such; however, it does not appear to have that functionality. If it's there and I missed it, cool, there's that much less work that needs to be done.

Additional context

There are a few images on Docker Hub that provide csvkit, but they're largely several years old. For what it's worth, csvkit regularly provides revisions, the most recent of which (latest / v1.5.0) was released on 28 March, 2024. (point: existing images are behind the current release). I can put together a pipeline to watch the csvkit repo for new releases and package / publish updated images.

I'm happy to do the work to implement this and submit a PR assuming folks are cool with the idea.

The fact that CSV has a bunch of limitations, that JSON, TOML, XML, or YAML (etc.) may be a better match to represent data. That's given and I don't dispute it. Unfortunately, it's not my call about how the data are represented but I do have responsibilities to make sure the pipeline from developer to production detects (and notifies me on) as much noise as possible.

nvuillam commented 2 months ago

Hi @wesley-dean-flexion :)

That seems to be a good idea , you have my go to start implementing :)

About the complexity to call csvkit, you might need to create a python class to handle it :)

wesley-dean-flexion commented 2 months ago

(apologies.. I muscle-memoried cvsclean instead of csvclean when creating the branch... ugh...)

Started with csv-clean which is not yet ready for anything. A few questions:

linter_name is the command to run to do the linting; in this case, it would be csvclean because the name of the executable is csvclean .. right?
cli_lint_extra_args can be used to pass --dry-run while cli_lint_fix_remove_args can also be set to --dry-run so that in non-fix mode, it'll pass --dry-run while fix mode will not pass --dry-run... right?
we can use Python regex mechanics (e.g., (?i) to make a regex case-insensitive, [[:space:]]+ for 1 or more white spaces, etc.).. right?
csvclean gives differently-formatted output depending on if it's run with --dry-run or not:

$ csvclean --dry-run acronyms.csv
Line 289: Expected 4 columns, found 5 columns
Line 1196: Expected 4 columns, found 6 columns
Line 1241: Expected 4 columns, found 8 columns
Line 1242: Expected 4 columns, found 2 columns
Line 1307: Expected 4 columns, found 3 columns

### note: NO acronyms_err.csv generated here

$ csvclean acronyms.csv
5 errors logged to acronyms_err.csv

### note: acronyms_err.csv and acronyms_out.csv ARE generated here

so, this is where a wrapper which would live in the linters directory would reside... right? The Python class would look to see if we're running in fix mode or not and apply the --dry-run flag as-needed, grab the correct output, make sure the original file is what's pushed when in fix mode, etc.. right?

wesley-dean-flexion commented 2 months ago

I'm working with @jpmckinney on some interface changes (wireservice/csvkit#1239) that ought to simplify this integration. As a result, when v2.0.0 comes out, a lot of what I wrote before will no longer matter.

Additionally, I submitted wireservice/csvkit#1240 to containerize the tool and publish official images that could be used instead of building the tool via pip during the MegaLinter build process. Hopefully this will simplify the build and isolate MegaLinter from any build problems, interface refactoring, etc..

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

wesley-dean-flexion commented 1 month ago

I'm waiting on a PR approval from the csvkit folks so I can move forward with this.

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you think this issue should stay open, please remove the O: stale 🤖 label or comment on the issue.

oxsecurity / megalinter

Integrate CSV linter #3493

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context