Error after running docker-compose up -d

I would like to get a Cluster setup with Docker Swarm Mode and Traefik, do you have more detailed docs for that? (I've just started teaching myself the cmd-line and python in the last 30 days so its hard to understand.)

I get this error when I go to: http://localhost/swagger/ Fetch error NetworkError when attempting to fetch resource. http://dev.test.com/api/v1/swagger/ Fetch error Possible cross-origin (CORS) issue? The URL origin (http://dev.test.com) does not match the page (http://localhost). Check the server returns the correct 'Access-Control-Allow-*' headers.

What is {{cookiecutter.domain_dev}} supposed to be in hosts file?

Wow, you're learning very fast if you got up to this point starting only 30 days ago! :clap: :tada:

So, you probably ran this command in your command line:

pip install cookiecutter
cookiecutter https://github.com/tiangolo/full-stack

It should have asked you for your project name and specific configurations. After that, it should have created a directory with all the generated code, specific for your project.

Inside that generated code, there should be a file README.md, specific for your project too. And with instructions for you (the developer), to set up your environment.

Just in case, files ending in .md are written in "Markdown" format. It makes them readable as simple text files, but they can be formatted nicely by some applications, like GitHub. If you want to read the README.md file generated for your project with all the format, I suggest you use VS Code. It has a command for "Markdown preview". I also recommend VS Code for code editing, it has a nice plugin for Python that can help you with code coloring, completion, etc.

Back to your generated README.md file, it is generated from this template. That's where I guess you are seeing the {{cookiecutter.domain_dev}}.

If you open your own README.md, in the sections where there's a {{cookiecutter.domain_dev}} you should see your own domain, in this case: http://dev.test.com.

I understand you are using Linux, right? Here are the specific steps for you.

You are going to edit a file that tells your computer some custom domains that should point to custom IP addresses.

Use the command line to copy the system file hosts, to create a backup, just in case something doesn't work, you can recover it, but as it is a system file, do it with administrative privileges, prepending it with sudo:

sudo cp /etc/hosts /etc/hosts-backup

Note: when you use sudo, it asks you for your password to perform the command with administrative privileges. It won't show you the command, not even *****, but it is receiving it. After you (blindly) type it, hit enter.

Now, use the command line program nano to open the hosts system file, with administrative privileges, prepending it with sudo:

sudo nano /etc/hosts

That will open the command line program nano, with that you can write stuff to that file, and then to finish, hit Ctrl + x. It will ask you if you want to save the file or not. When you first open it, it probably shows something like:

# GNU nano 2.9.3                                     /etc/hosts

127.0.0.1       localhost

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix

^G Get Help    ^O Write Out   ^W Where Is    ^K Cut Text    ^J Justify     ^C Cur Pos     M-U Undo
^X Exit        ^R Read File   ^\ Replace     ^U Uncut Text  ^T To Spell    ^_ Go To Line  M-E Redo

You need to add a line below the one that looks like 127.0.0.1 localhost. The line you need to add, in your case, given that you did set the development domain as dev.test.com would be:

0.0.0.0    dev.test.com

It would also work with:

127.0.0.1    dev.test.com

So, the contents of your final file, would look like:

127.0.0.1       localhost
0.0.0.0    dev.test.com

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix

Now, when your browser (let's say Chrome) asks for http://dev.test.com, it won't communicate to a remote server, but to the same machine (your computer). Now, you can open http://dev.test.com/swagger/ instead of http://localhost/swagger/, there you should be able to see the same web UI (user interface) but this time, you shouldn't get the error you are getting now.

Let me know if that works for you and any doubts or questions you might have.

After that, I can help you set up your remote "cloud" system with your complete web application and set up the Docker Swarm mode cluster with Traefik. I just wrote this article about exactly that. But leave that for later, for now, let's get your local environment up and running.

Ok, I am back after a long week at work whew and OMG thank you so much for explaining all that. I am about to dive in. Much too kind <3

I'm glad to help! :) ...but now, forget most of the previous instructions.

I just updated several parts of the project generation to make it a lot easier to get started. It should be a lot simpler to do now. And you shouldn't have to do all the configurations above.

I'm sorry if you already went through all that. Still, if you did, you learned a lot about Linux and that trick might be useful later :grin:

Now, I suggest you generate a project from scratch to take advantage of these new configurations, including a new very simple frontend (just a working login and dashboard template page).

It should be about 3 steps and you should be up and running :)

If you had any code in your current project, rename it to be a "backup" so that you can copy anything you need to your new project, e.g.:

mv ./test ./test-backup

Now generate a new project:

cookiecutter https://github.com/tiangolo/full-stack

...it should tell you something like:

You've downloaded /home/user/.cookiecutters/full-stack before. Is it okay to delete and re-download it? [yes]:

...then you have to type yes, that way you get the latest changes. If it doesn't, cancel it with Ctrl + c and try again. After that, it will prompt you for your project's data.

Enter into your project directory, e.g.:

cd ./test2

Just in case you used the same project name, to make sure it builds the Docker "images" from scratch, tell it to build them:

docker-compose build

And now, start your stack:

docker-compose up -d

...it should work without having to fiddle with all the stuff above about the hosts file or anything else.

You can now go to: http://localhost/swagger/ and it should just work.

You can also go to: http://localhost and use your "first superuser" admin account to log in. You will only see an empty dashboard, but it shows you that the login is working.

I hope that helps!

You did some amazing work there... thank you! For the most part it's all up and running except Flower, I cant log into that on localhost:5555.

Now I am a little confused, and have a cpl questions:

Do I still need to play around with that hosts file, per your instructions 2x above?
How do I add another container as a service; My goal is to take this tutorial, here: https://www.toptal.com/apache/apache-spark-streaming-twitter , and hack it. So Spark+PySpark lives in one container and ingests and sorts the data, then it sends that "cleaned" data to another container that I will use FeatureTools in for feature engineering, and then outputs with that swagger api. I am basing this concept off of these articles: A) https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3 and B) http://tlfvincent.github.io/2016/09/25/kafka-spark-pipeline-part-1/ <- that diagram makes the most sense to me.
Is a cloud based cluster something i can log into my desktop at home, with my laptop when i am in a cafe and use the Juypter Notebook?
What is a traefik_constraint_tag for and what should I name it?

Flower is just an administrative visualization tool to monitor Celery jobs. It might be that Rabbit MQ has not turn on yet or something. For Flower you should see a prompt asking for your a user and password. Those are not the same as the ones for the app. You can check what you have declared in your file .env, in the variable FLOWER_BASIC_AUTH. Also check the logs with docker-compose logs to see if there's any problem with Flower or Traefik.

Nope. You don't have to play around with the hosts file if you use the new setup and a new generated project.
Whoa! That doesn't seem like something you would plan to do if you just started 30 days ago :joy: ...Anyway, as this is quite an advanced topic, I'll just assume you already know about all this. A good idea might be to set up machine learning stuff (Spark) in a Celery worker container. That worker container can process the data and save a model somewhere (and you can start several worker containers to parallelize). Then you would save the model as a file in a volume shared between your workers and your backend service / container. Or you can save it to a remote file storage system, as Amazon S3 or Google Cloud storage. And then in your backend / API you can load that model and serve it with Swagger / OpenAPI.
If you want a Jupyter Notebook just for quick Data Science stuff and exploration it might be easier to just start a Docker container directly in your remote host. You wouldn't deploy a Jupyter Notebook as a production app, it's more or less, only exploratory (and it's great at that), but not for a stable, user-facing, production application.

Good luck!

Ok, so I got flower going by restarting my computer and spinning everything back up with Docker. As for hosts file thing, thats cool you took care of that automatically in the build. As for the pipelines, I just read stuff online and see how ppl are doing it - I coded c++ in high school for like a semester, my major is mathematics and we use verilog (making circuits) and matlab (physics). I've only been doing Py for about 5 weeks now. But I just look at the tutorials, like the diagram in the link i posted above because it made sense.

I found this -> https://github.com/gregbaker/spark-celery I am just confused as to how I can get Spark in the app.

Btw... for all your advice I owe you coffee or a gift card or something. Let me know because i am very appreciative of your guidance <3

You solved Flower! :tada:

Math, C++, Matlab... nice!

I think the problem is still not very clear, it would be good to define where the data is coming from. Is your code pulling it or is something sending it to you, is it a stream of data like listening to Twitter live? Is it sensor data that is being sent to your system to an API endpoint? Are you pulling static stored data from some place or system? Can you process the data, save the results and then just serve the results as an API? The final architecture will depend a lot on all that.

It's also quite possible that Spark and family are an overkill for your project, you might be able to solve it with just Celery workers (still, it depends on all of the above).

For example, you could have a Celery worker with the code it needs to process the data and save it to the DB. And then you replicate that Celery worker container like crazy, and get a lot of Celery worker containers. Code-wise is super simple, but you get a massively distributed, parallel, data processing system.

And then you create a Flask endpoint that receives the parameters you need to pull the data, for example, "since 2017-01-01". And based on that, from the same Flask endpoint, you send a bunch of Celery jobs, a lot of them, each to process one little batch of the data. Creating the jobs is not expensive, so you can do it in the Flask endpoint. And they get sent to RabbitMQ and queued for processing. And the execution is asynchronously done by the workers. That way your Flask app keeps responsive while data is still being ingested by the workers.

There are still several corner cases but you get the idea. And that could be a simpler architecture that might work for you and actually fit the problem better.

Also, if you are going to handle "Big Data", you might benefit from using a NoSQL DB instead of PostgreSQL. Like Couchbase: https://github.com/tiangolo/full-stack-flask-couchbase

You owe me a coffee! I'm curious (as are @mariacamilagl and @cesarandreslopez now too) about who you actually are in real life :coffee: :joy:

Where do I send the coffee gift card? I am me: single mom and personal trainer by day, math super nerd by night :)

Also, I’ll try Couchbase as I need spark - it’s Twitter and SQL.

Is that link to full-stack/couchbase the same type of set up as this one?

I am me: single mom and personal trainer by day, math super nerd by night :)

Wow, ok, that's impressive, awesome. Hats off to you then. :tophat: :muscle:

Where do I send the coffee gift card?

Don't worry, I'm kidding.

Also, I’ll try Couchbase as I need spark - it’s Twitter and SQL.

If you want/can, share a bit more of the details of what you want to build, that way I can help/orient you better.

About Couchbase: Ok, so, Couchbase is not a requirement for Spark, but it might be of use if you are putting so much data. But if you are already familiar with relational databases and SQL (PostgreSQL) you can just try with your current stack. If it's the same to you, then try Couchbase.

About Twitter: If you want to read Twitter data to ingest that into the DB, you might not need Spark. I'll imagine you are taking a bunch of tweets related to some topic (or something similar). Would the cleaning and transformation of a single tweet fit in a single Python function? Like in pure Python, without Spark? If that's the case, then you can put that in the Celery worker, and do all the massive processing with Celery workers. The benefit of doing that instead of Spark is that the actual code that you will need to write is way more simple, and still very powerful, parallel and concurrent.

About Spark: Here's the thing. Spark is designed to be integrated into some specific cluster management systems, very tightly integrated into the rest of those clusters. Including even memory management, as it is built in Scala which in turn runs in the Java VM. So, to run Spark in a distributed manner, you need to set up a cluster the way it expects it, including additional services like YARN or alternatives. I guess the "simplest" way to set that up would be with "Spark Standalone Mode". But still, that requires a lot of set up and configurations.

About ML (Machine Learning): Are you planning on doing ML with that? If that's the case, you might want to do Deep Learning (neural networks), probably with Keras. You can use general pretrained models and get great results with that, using "embeddings" (like "Word2Vec"). But that doesn't need Spark, if that's why you are interested in Spark.

About Data Science projects vs Web Apps:

Data Science projects tend to require sophisticate techniques, tools, etc. For example, in a Data Science project you might want to have so much data that it's not feasible to have it in a database, not even in a Couchbase distributed cluster, and you want to do ad hoc SQL-like queries on that data, and let the thing run for some hours or a day, to get the results later, and that's OK because you need to do that huge processing that is not possible any other way, and you don't care if the whole process crashes, but it finishes at some point, etc. And you don't need an API, a frontend, asynchronous jobs, HTTPS, etc. Then that's more of an only-Data Science massive project. And you probably don't need this full-stack project or its components actually :disappointed: . In that case, you might just need a remote notebook connected to a Spark cluster. In that case, you might want to check the official Jupyter/Spark Docker image: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#apache-spark or also Zeppelin

But let's say that what you need is more of a full-stack web application, you need to have some API that should be very stable, and always respond to your users independently of the processing done behind the scenes (even if it's Machine Learning), and you need it to respond to your users immediately (in seconds, not hours), and you could expose an API endpoint that runs a pretrained model and gives the predictions back (but is pretrained and can be loaded live in this API), and you need to have security / login in place, and a modern frontend dashboard, and you need to have everything in secure HTTPS, and you might have the frontend being built by a frontend development team that needs to talk to your backend API, and they can take advantage of the self-documented / interactive API exploration system that comes integrated for free, or you need developers of other apps to be able to use that same API documentation and have clear what they need to provide and what to expect, and what are the available endpoints, etc. And you can do the ML you need in a function of the Celery worker (like training a Keras/TensorFlow deep learning model), but you don't need to store stuff in a Hadoop-like system, and process massive stored files with Spark. That's more of a full-stack web application. Maybe with additionals of Data Science too, but you care about having the API, serving users, etc. In that case, this project (full-stack) or the siblings would probably be a very good fit :smile:

The main difference, I think, is that Spark and family are concerned a lot about being distributed massively, compute complex stuff and allow it to crash. And they don't expect to be involved in handling final users directly. And what I'm calling full-stack web apps are mainly concerned on serving final users, probably lots of them. In a very reliable way. And everything else works around that.

Here's how I think you could do something with this project, without Spark:

A backend API endpoint that you call, say /api/get-tweets, and it sends a bunch of Celery tasks.
The Celery workers execute the function for those tasks, go to Twitter, get the data and save it to your DB. It takes a lot of time, but as the execution of the tasks was asynchronous, your users are not waiting for it to complete several minutes, because the API already told them that the instruction was received and is being processed.
Your users keep asking the API if the job is ready, the API checks the DB and tells them "not yet, wait a bit more".
Your Celery workers start finishing each little sub-task, and start saving data to your DB.
Now your API might give your users partial data from what is already processed, or tell them that all of the data is not ready yet until all your Celery workers write all the data.

Now, if you absolutely need both worlds, Spark and family, and full-stack stuff for final users, what you can do is this:

A backend API endpoint that you call, say /api/get-tweets, and it sends a Celery task.
The Celery worker gets the task, then connects to your Spark cluster and sends your "Spark Application", maybe using spark-submit. After receiving confirmation that the "Spark Application" was received, it saves a record in your DB.
When your users check asking an API endpoint if the job is done, your API checks that same record and tells them "wait a bit more".
Your Spark cluster driver received the "Spark Application" and starts it, it handles the rest to send the tasks to your Spark workers and in the end collects the results and saves them to the same DB. And then it changes that previous record that said that the job was pending and marks it as completed.
Now when your users request the result, your backend can read them from the DB.

That previous thing would require setting your own Spark cluster, which is quite complex. But for the rest of the components you can use this project.

About the articles you cited before: The Toptal article is like a very toy-ish project, doing very dangerous things as storing stuff on a global variable and expecting Spark to be always available on a TCP port (Spark is awesome, but it crashes a lot). I skimmed the second article but I didn't see a lot of technical content, more like general ideas. The diagram of the third article makes sense, but is a different architecture, although somewhat equivalent, to what you could have wiht this project alone.

About that 3rd diagram, I think the event producers probably wouldn't exist in your problem, because you wouldn't have data being sent to you, you would have to go to Twitter to ask for it and get it (if your actual project is something similar to what I'm imagining). In that case, the closest equivalent would be your Celery workers getting the data.

The Kafka layer is a message system. It's very similar to RabbitMQ (which is what this project uses and what Celery itself uses), but is dessigned for more massive clusters and problems. I understand Kafka has the "best performance" of all the message brokers, but has a couple of difficult disadvantages. It might deliver messages more than once (so your task is executed more than once, that might be a problem). It doesn't necessarily respect the order of the mesages, which might be a problem too. I understand it has several components and requires a cluster of multiple nodes. RabbitMQ can give you great performance on a single-node app, probably not as good performance as Kafka, but also gives you that coulple of advantages that simplify your code and architecture.

The Spark layer, if what you need to do is to process each tweet with a function and generate an output out of it, the equivalent would be another Celery worker function. You could have one (described in the paragraph above) that just takes the messages from Twitter and sends another Celery task (from the execution of a Celery task, you can do that) that says "here's the message, process it". And then, another Celery worker task function (it might be the same worker or a different one) will take it and process that data. This second Celery task function would be the equivalent of Spark in the diagram.

Then this second Celery function, once it finishes processing the data, could write the data to the DB. Let's say you use Couchbase. In the diagram, Couchbase would be the equivalent of the features of HBase, MongoDB and Redis combined.

Then, the last layer of the diagram, the "frontends" would be the frontend of this full-stack. The "Dashboard/Analytics" might be also the same frontend using a visualization tool. Or could be a Tableau connected to the API.

Now, there would be a layer, not shown in the diagram, between the "Data and Result Storage" system and the "Data and Result Distribution" (the one with frontends and analytics/dashboards). This layer would be the API, backend. This layer would read from the data storage directly, and would expose endpoints, and those endpoints would be secured with credentials, and would read the queries from the frontend and would return the responses in JSON.

Is that link to full-stack/couchbase the same type of set up as this one?

Yes, https://github.com/tiangolo/full-stack-flask-couchbase and this project share a lot of the same code. So, if you already have some code using full-stack, keep it, putting it in the Couchbase project might be a lot of just copy-paste.

I think this post got too long, and maybe somewhat confusing. Sorry for that. If you want, just share a bit more of what you want to achieve and I can give you just some simple tips.

Hi @headshothottie , I'll assume you were able to solve your problem and I'll close this issue now.

If you are still using this project, I suggest you check the equivalent project generator for FastAPI that solves the same use cases in a much better way.

Because of that, this Flask-based project generator is now going to be deprecated. You are still free to use it, but it won't receive any new features, changes, or bug fixes.

tiangolo / full-stack

Error after running docker-compose up -d #7