theodi / octopub

Publish data easily, quickly and correctly
https://octopub.io/
Other
41 stars 18 forks source link

Validating prepublished dataset files #717

Open langphil opened 6 years ago

langphil commented 6 years ago

Octopub currently provides CSVLint with a URL query string to validate a CSV and return a status to Octopub.

<td><a href="https://csvlint.io/?uri=<%= @dataset.gh_pages_url %>/data/<%= file.filename %>"><img src="https://csvlint.io/?uri=<%= @dataset.gh_pages_url %>/data/<%= file.filename %>&format=svg" alt="CSVlint validation result" /></a></td>

This query string points at the Github repository that was created during Octopub's publishing process. As part of the current development, we should be providing the S3 upload url as a Query String.

The issue is this - when a CSV is provided to CSVLint it not only validates the file, it also publishes it, making it available for download - this is against the scope of the current Octopub development.

Without the function of private validation of CSV files the vision for Octopub as a prepublishing tool is lost.

Goals

olivierthereaux commented 6 years ago

I agree that the way octopub currently calls csvlint.io is not acceptable given the shift to a pre-publishing workflow.

Passing the S3 uri to csvlint.io is not acceptable either – regardless of whatever security policy we use in S3, I do not think it would be OK to expose the secure-through-obscurity URI of a yet unpublished resource, as csvlint would automatically publicise the URI in the its "recent validations" page.

Think we've got 4 options:

  1. Kill off the "recent validations" page on csvlint.io. It's not helpful at all, and causes us recurring grief. Pros: relatively easy to do, and kills two birds with one stone. Cons: I would worry that this may still expose the S3 uri beyond what is reasonable.
  2. Spin off a separate instance of csvlint.io purely for use with octopub. Pros: fairly trivial to do. Cons: more maintenance and hosting cost, plus see above on point 1)
  3. Stop using csvlint.io and use the csvlint.rb library internally instead. Pros: would be way more secure, and probably much more efficient. Cons: would require more significant development, and may also make it harder in the long run to integrate with lintol rather than csvlint, if/when we decide to switch.
  4. Keep using csvlint.io, but instead of href-ing to it, POST the actual file payload to it. csvlint.io accepts both a GET with the uri as query string, or a POST for file upload; and it does not list resources POSTed to it in its recent validations page. Pros: much safer, and does not preclude switching to a similar access point in Lintol in the future. Cons: similar to the solution of using csvlint.rb above.

My preference would be for 4) if we think it is doable, or 1) as a quick-and-dirty workaround.

rachelwilson commented 6 years ago

Thanks for this!

(just double checking i'm not missing something) For point 4's Cons, when you say "similar to the solution of using csvlint.rb above" did you mean just the "would require more significant development" part, but not the "may also make it harder in the long run to integrate with lintol" part.

Out of interest: do we know why POSTed validations don't appear in the "recent validations" list? Was that a conscious decision or a technical quirk?

olivierthereaux commented 6 years ago

For point 4's Cons, when you say "similar to the solution of using csvlint.rb above" did you mean just the "would require more significant development" part, but not the "may also make it harder in the long run to integrate with lintol" part.

Correct!

Out of interest: do we know why POSTed validations don't appear in the "recent validations" list? Was that a conscious decision or a technical quirk?

As far as I can tell, because the content payload is POSTed, unless csvlint can store it and create a URI for it, it can't point to a URI and therefore it makes no sense to add it to the recent validations.