Documentation and Installation question(s)

JvD007 commented 3 years ago

Some questions around installation and user documentation:

What do we need to install to get Kamu up and running, by not using Docker?
Is it possible to give examples on all the functions of the kamu-cli, the help is not giving the best answer what you can do with it. The Add en Pull is clear and sql and notebook also.
Some examples with Python, SparkR and maybe others

Create a dataset on S3 etc

TX, Jaco

sergiimk commented 3 years ago

Hi Jaco, thanks for the feedback!

I agree, we will focus on better first-time user experience in the next few days including:

better help messages
documentation
and examples

I will be letting you know of the progress here as we add/update things.

Regarding docker, we use it for these purposes:

As a simple way to distribute data processing engines like Flink and Spark, without requiring users to install these big frameworks by hand
As a way to distribute auxiliary tools like Jupyter and its dependencies
As a way to ensure reproducibility:
- by running engines in isolated "sandbox" environments so they don't intentionally or accidentally depend on external non-reproducible resources
- and by associating specific versions of engines with every transformation

So I think getting rid of docker completely can be our long-term goal. But in short term we can start moving more and more features out of the docker allowing majority of users (data consumers) to use kamu without it.

For example:

Using a simple built-in SQL engine for kamu sql shell, not to depend on Spark
Using the same for data ingestion (e.g. reading CSV into ODF dataset)
Allowing people to use Jupyter that is already installed on their machines

Overall docker is becoming more and more popular in data science community, as it simplifies repeatability and sharing of data projects. So if you encounter some specific issues with it we can try help you address them (e.g. link a step-by-step guide on setting up docker with WSL2.

sergiimk commented 3 years ago

In #35 I've added detailed help to all commands along with common usage examples (released in v0.38.2).

I also gave Metadata Reference a better structure to make the job of writing dataset manifests easier.

Also added a new section of documentation on Merge Strategies with examples.

sergiimk commented 3 years ago

Update: kamu now supports podman for running containers without the need for sudo or any privilege escalation possibilities.

kamu-data / kamu-cli

Documentation and Installation question(s) #34