learningOrchestra / mlToolKits

learningOrchestra is a distributed Machine Learning integration tool that facilitates and streamlines iterative processes in a Data Science project.
https://learningorchestra.github.io
GNU General Public License v3.0
75 stars 23 forks source link

Improve Docs #91

Open pottekkat opened 3 years ago

pottekkat commented 3 years ago

The docs of learningOrchestra need to be validated, tested and improved by test users. The doc could be the README and the docs page.

LaChapeliere commented 3 years ago

Hi, I'm happy to help with the doc, because I had trouble understanding what the project was about when looking through the readme. Focussing on user doc (rather than dev doc), I'm proposing to go over the Readme with the view of a user who knows how to do data mining scripts but might not be familiar with infrastructure, cloud services, microservices...

LaChapeliere commented 3 years ago

Proposed outline:

  1. One-sentence summary
  2. Thumbnail and indicators
  3. Introduction: what is the project, who is it for, why should I use it?
  4. Table of Contents
  5. Quick-start
  6. Installation instructions
  7. Usage instructions
  8. About learningOrchestra
LaChapeliere commented 3 years ago

I'll start a branch and update here when I'm missing info

pottekkat commented 3 years ago

@LaChapeliere Thank you for contributing. Sure you can make the changes to the README and the docs repo. You can maybe start a draft PR so we can track the progress.

LaChapeliere commented 3 years ago

First question, probably for @riibeirogabriel "learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool" After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

pottekkat commented 3 years ago

I think so too. I also think that the exact purpose of this project gets lost somewhere in the docs. How can we clear things in the README and the docs?

PS: We can first fix the README and then move on to the rest of the docs.

LaChapeliere commented 3 years ago

@navendu-pottekkat I've shared a first proposition for the intro in this draft PR :arrow_up: You are right, let's not try to change everything at the same time. Plus improving the docs requires to understand the code so it's more work ^^

riibeirogabriel commented 3 years ago

First question, probably for @riibeirogabriel "learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool" After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

Yep, I also guess. We can make the readme with data scientist porpose, the "distributed machine learning processing tool" was provided in project begin, when us wanted make a distributed tool, but this mindset was changed.

LaChapeliere commented 3 years ago

First question, probably for @riibeirogabriel "learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline " but "learningOrchestra is a distributed Machine Learning processing tool" After reading your monograph, I think I understand why you used data science in one case and machine learning in another, but for user-friendliness purposes, I strongly recommend picking one to describe your project. Personally, I'd use data science/data mining because you're talking to data scientists. Then you can refer to machine learning methods for analysis microservices, of course

Yep, I also guess. We can make the readme with data scientist porpose, the "distributed machine learning processing tool" was provided in project begin, when us wanted make a distributed tool, but this mindset was changed.

I'll write the Readme from that perspective, and you can decide whether to change the motto and thumbnail later on that way

LaChapeliere commented 3 years ago

@riibeirogabriel I'm trying to figure out the installation process. There's one big thing that is unclear for me: you mention Linux hostS and clusters, so I'm guessing you can run learningOrchestra in multithread over several machines? If that's the case, how/where do you link the machines together? Also, you need to already own the machines on which you are running learningOrchestra, right? So you have to rent some VMs from some cloud provider first? Do you plan to add a feature where learningOrchestra can facilitate that?

riibeirogabriel commented 3 years ago

We link the machines with docker swarm, a requirement to run the learningOrchestra is a docker swarm cluster provided by user, with this cluster, we can run the learningOrchestra without worry with infrastructure, and yes, the user need rent machines in cloud to use learningOrchestra, but if already have local machines, is possible run locally, What kind of feature you think to can facilitate this infra and cloud environment? Did you understand?

riibeirogabriel commented 3 years ago

The user need create a docker swarm cluster to link the machines.

LaChapeliere commented 3 years ago

Yup, I understand, thank you! I have no idea how to set up a cluster swarm but guess I'll have to read up on it then :D I'm not sure what kind of feature would facilitate that, but I know it would make a huge difference for non-"System and architecture" oriented data scientists like me. Honestly, even ssh-ing into the university-maintained cluster makes some of us run away screaming :scream_cat:

riibeirogabriel commented 3 years ago

We put in requeriments a link teaching how create a docker swarm cluster, it is easy! i agree with you, we need make more easy the architecture/infresctructure part from learningOrchestra, I will create an issue!

LaChapeliere commented 3 years ago

@riibeirogabriel The microservices rest API has to be called with curl, right? Could you give me any example of a complete command you would enter in the terminal of your manager instance to use one of the microservices? (Sorry for my many tech questions)

riibeirogabriel commented 3 years ago

@LaChapeliere curl don't give a friendly use, we can use other programs with GUI like postman or insomnia to use the REST APIs, the calls to each API is showed in each microservice docs, take a look in a GUI REST API caller (postman) using a learningOrchestra microservice: image You can see each microservice API calls at https://learningorchestra.github.io/learningOrchestra-docs/database-api/

LaChapeliere commented 3 years ago

@riibeirogabriel I'm not sure whether to cry or be happy, I've always muddled through with curl because I thought I didn't have any other option... Thanks!

riibeirogabriel commented 3 years ago

hahaha, the python package was created to abstract the calls to user, the user only need call the methods from each microservices Class and it is done!

LaChapeliere commented 3 years ago

@riibeirogabriel I've pushed the quick start to PR #95, could you please check that I haven't missed anything? That way I'll dive into the more detailed instructions

LaChapeliere commented 3 years ago

@riibeirogabriel Would learningOrchestra work on a single-node swarm or is it a deal breaker?

riibeirogabriel commented 3 years ago

If this node has around 12 gb of RAM, a quad-core processor and 100 gb of disk, it can work with small datasets, but to treat with big data, a cluster is necessary!

LaChapeliere commented 3 years ago

If this node has around 12 gb of RAM, a quad-core processor and 100 gb of disk, it can work with small datasets, but to treat with big data, a cluster is necessary!

Wow, quad-core processor?

Yeah, I guessed resources would be the problem, though I didn't imagine it to require that much. I just wanted to check that learningOrchestra wouldn't crash on a single node.

LaChapeliere commented 3 years ago

Another noob question, because I don't have the setup to run it on my own computer: I imagine when we run sudo ./run.sh, it runs on a loop and we have to run the python client commands/the REST calls from another terminal? And we can terminate learningOrchestra like any command-line program, with Ctrl^C? Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?

riibeirogabriel commented 3 years ago

Yeah, I guessed resources would be the problem, though I didn't imagine it to require that much. I just wanted to check that learningOrchestra wouldn't crash on a single node.

Yep, learningOrchestra was doesn't planned to run in a single node.

riibeirogabriel commented 3 years ago

Another noob question, because I don't have the setup to run it on my own computer: I imagine when we run sudo ./run.sh, it runs on a loop and we have to run the python client commands/the REST calls from another terminal? And we can terminate learningOrchestra like any command-line program, with Ctrl^C? Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?

Doesn't works in this way, the run make the deploy and it finish haha, to shutdown the learningOrchestra is ncessary run docker stack rm microservice, you remind me to put this in the docs, there is no information of how shutdown this software.

riibeirogabriel commented 3 years ago

Bonus question: Have you tried to see what happens with terminal multiplexers that enable persistent session, like tmux?, it apply if the run.sh execution is endless, right?

LaChapeliere commented 3 years ago

Yes, it doesn't matter since the deploy is one-time.

LaChapeliere commented 3 years ago

How long is the deploy, typically?

riibeirogabriel commented 3 years ago

Around 10 minutes, but it rely from machine resources.

LaChapeliere commented 3 years ago

@riibeirogabriel Could you confirm the categories I've assigned to each microservice, please?

Database API- Gather data Projection API- Visualize data and results (I'm assuming we're talking about mapping data points into display points?) Data type API- Clean data Histogram API- Visualize data and results t-SNE API- Train machine learning models + Visualize data and results PCA API- Train machine learning models + Visualize data and results Model builder API- Train machine learning models

Any one of them can also be tagged "Evaluate machine learning models"?

LaChapeliere commented 3 years ago

Just a checklist to keep track of the progress:

riibeirogabriel commented 3 years ago

Database API- CRUD operations from preprocessed data/new data and results (Excepting t-sne and PCA APIs, each of them has own CRUD operations) Projection API- Preprocessing data, creating a projection from a stored dataset, i don't understand your ask. Data type API- Preprocessing data, changing the type of data fields (between text and number type, it is main of JSON types) Histogram API- Preprocessing data, creating a histogram from some fields in a stored dataset. t-SNE API- Preprocessing data, creating a t-SNE image plot from a dataset content. PCA API- Preprocessing data, creating a PCA image plot from a dataset content. Model builder API- Train, evaluete and predict machine learning models with several classifiers.

riibeirogabriel commented 3 years ago

@LaChapeliere Do you understand each microservice function?

riibeirogabriel commented 3 years ago

All microservices, excepting t-SNE and PCA make CRUD from your data using the database API mircoservice, then the user must use the database API microservice to visualize and handling results from ohter microservices.

riibeirogabriel commented 3 years ago

the t-SNE and the PCA don't store your data in mongoDB, then them have own CRUD operations in each.

LaChapeliere commented 3 years ago

I understand the microservices better now, thanks :+1: I was actually trying to sort them into categories corresponding to the data science pipeline steps you are covering. So could you label each microservice with those: Gather data, Clean data, Train machine learning models, Evaluate machine learning models, Visualize data and results? Or tell me how else you'd like to categorise them if this doesn't work :cactus:

riibeirogabriel commented 3 years ago

When we written the first monograph , we catalog the mircoservices in 8 types, they are: Load Data; Load Model, Pre- processing; Tuning; Training; Evaluation; Production; and Observer, but nowadays there are not all microservices types createds, we plan create all types till second monograph, but the exsting microservice are cataloged in this types: Database API- Load Data Projection API- Preprocessing Data type API- Preprocessing Histogram API- Preprocessing t-SNE API- Preprocessing PCA API- Preprocessing Model builder API- pre-processing, training and evaluation. (we will desacouple this microservice till second monograph)

riibeirogabriel commented 3 years ago

Then, maybe you meant categorize the types in preprocessing , right?

LaChapeliere commented 3 years ago

Hum, is your advisor a data scientist? Your preprocessing sounds weird to me. For me, preprocessing means preparing the data so that you can run analyses on it. Data cleaning, type casting, date formatting, stop word removal, ...

riibeirogabriel commented 3 years ago

One of my advisors is a data scientist, learningOrchestra needs more microservices to also make this steps, but you need know how model builder microservice works, the model builder has as parameter a python 3 code, this code is written by user, with this code the data scientis can treat the particularities from a dataset, we did think which none microservice can treat the particularities from all datasets, then we create the preprocess code param. Did you see the preprocess code created to titanic dataset? we make all mentioned steps by you (Data cleaning, type casting ( data type handler microservice haha), date formatting, stop word removal) and more! please take a quick view https://learningorchestra.github.io/learningOrchestra-docs/modelbuilder-api/. Model builder needs a string( the python 3 preprocess code), and then model builder interpretes this string as python code and make this instruction on dataset (a pyspark dataframe), if you scroll down to preprocessor_code Example section, you will see the titanic code, this code is sended as string (the python client package is the best way to make a request with this code). What do you think?

riibeirogabriel commented 3 years ago

And i will work in your PR tomorrow!

LaChapeliere commented 3 years ago

I think that you definitively need to cut that microservice into several ones :rofl: But I agree with your labelling for that microservice, I was more surprised that you labelled Projection API, Histogram API, t-SNE API, and PCA API as "Preprocessing". For me t-SNE and PCA are models. The projection can be simple visualisation, or machine learning model depending on what kind of projection it is. Histograms are definitively visualisation. I understand that they don't actually product a graph in this microservice setup, so I'm not sure how to label them, but I think "Preprocessing" is too confusing.

riibeirogabriel commented 3 years ago

You is right! i also think that preprocessing was confused, but we will improve in next releases to second monograph, and yes, we will cut the model builder microservice in several microservices.

riibeirogabriel commented 3 years ago

PCA and t-SNE definitively are not preprocessing microservices, my advisor have thinked in use them to visualizate the state of a dataset in each step on pipeline.

LaChapeliere commented 3 years ago

I'll give some thought to the category names tonight and propose something else tomorrow?

riibeirogabriel commented 3 years ago

Sounds good, there is no haste, and thanks!