Open wesley-dean-flexion opened 2 months ago
Hi @wesley-dean-flexion :)
That seems to be a good idea , you have my go to start implementing :)
About the complexity to call csvkit, you might need to create a python class to handle it :)
(apologies.. I muscle-memoried cvsclean
instead of csvclean
when creating the branch... ugh...)
Started with csv-clean which is not yet ready for anything. A few questions:
linter_name
is the command to run to do the linting; in this case, it would be csvclean
because the name of the executable is csvclean
.. right?cli_lint_extra_args
can be used to pass --dry-run
while cli_lint_fix_remove_args
can also be set to --dry-run
so that in non-fix mode, it'll pass --dry-run
while fix mode will not pass --dry-run
... right?(?i)
to make a regex case-insensitive, [[:space:]]+
for 1 or more white spaces, etc.).. right?csvclean
gives differently-formatted output depending on if it's run with --dry-run
or not:$ csvclean --dry-run acronyms.csv
Line 289: Expected 4 columns, found 5 columns
Line 1196: Expected 4 columns, found 6 columns
Line 1241: Expected 4 columns, found 8 columns
Line 1242: Expected 4 columns, found 2 columns
Line 1307: Expected 4 columns, found 3 columns
### note: NO acronyms_err.csv generated here
$ csvclean acronyms.csv
5 errors logged to acronyms_err.csv
### note: acronyms_err.csv and acronyms_out.csv ARE generated here
so, this is where a wrapper which would live in the linters directory would reside... right? The Python class would look to see if we're running in fix mode or not and apply the --dry-run
flag as-needed, grab the correct output, make sure the original file is what's pushed when in fix mode, etc.. right?
I'm working with @jpmckinney on some interface changes (wireservice/csvkit#1239) that ought to simplify this integration. As a result, when v2.0.0 comes out, a lot of what I wrote before will no longer matter.
Additionally, I submitted wireservice/csvkit#1240 to containerize the tool and publish official images that could be used instead of building the tool via pip
during the MegaLinter build process. Hopefully this will simplify the build and isolate MegaLinter from any build problems, interface refactoring, etc..
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
If you think this issue should stay open, please remove the O: stale š¤
label or comment on the issue.
I'm waiting on a PR approval from the csvkit folks so I can move forward with this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
If you think this issue should stay open, please remove the O: stale š¤
label or comment on the issue.
tl;dr: I would like to add
csvclean
(from csvkit) as a linter. I'm happy to do the work if people think this is a good idea.Is your feature request related to a problem? Please describe.
One of my repos includes CSV files and they can sub-optimal. Just as we have linters (and reformaters) for JSON, XML, and YAML, I would like to add a CSV linter.
Describe the solution you'd like
There exists a package, csvkit, that includes a tool to lint and cleanup CSV files:
https://csvkit.readthedocs.io/en/latest/scripts/csvclean.html
I would like to add
CSV_CSVCLEAN
(the name isn't consequential to me; I just picked it because ofYAML_YAMLLINT
) that would lint the list of files. When run withAPPLY_FIXES
, it would not include the--dry-run
flag tocsvclean
; when run withoutAPPLY_FIXES
set, it would include the--dry-run
flag).Running
csvclean
on a CSV file results in two files being created, per the documentationI noticed that running
csvclean
on a known messy file (i.e., one that produces errors due to being not totally valid) will NOT set$?
but it will generate[basename]_err.csv
, so something like this might be helpful:(so
$?
is set,$@
would have the additional arguments,$APPLY_FIXES
is set to something when we want to fix stuff, etc.. point: a little "syntactic sugar" could be helpful in making it work the way we want.)Describe alternatives you've considered
I haven't thought through very many alternatives. I did look through prettier to see if it could clean up CSV like it can for YAML and such; however, it does not appear to have that functionality. If it's there and I missed it, cool, there's that much less work that needs to be done.
Additional context
There are a few images on Docker Hub that provide csvkit, but they're largely several years old. For what it's worth, csvkit regularly provides revisions, the most recent of which (latest / v1.5.0) was released on 28 March, 2024. (point: existing images are behind the current release). I can put together a pipeline to watch the csvkit repo for new releases and package / publish updated images.
I'm happy to do the work to implement this and submit a PR assuming folks are cool with the idea.
The fact that CSV has a bunch of limitations, that JSON, TOML, XML, or YAML (etc.) may be a better match to represent data. That's given and I don't dispute it. Unfortunately, it's not my call about how the data are represented but I do have responsibilities to make sure the pipeline from developer to production detects (and notifies me on) as much noise as possible.