Scalable data ingestion architecture with microservices

teocns commented 4 months ago

Building the Docker image I noticed it was sized at 7GB in size and that it took 5 minutes for the build to complete.

REPOSITORY	TAG	IMAGE ID	CREATED	SIZE
super-rag-api	latest	1c9aa42e2450	2 days ago	7.09GB

It turns out PyTorch is responsible for pulling its massive nvidia driver libraries packaged in torch[gpu], while its sibling torch[cpu] bottoms down at 56mb.

pip list sorted by size

``` $ pip list | tail -n +3 | awk '{print $1}' | xargs pip show | grep -E 'Location:|Name:' | cut -d ' ' -f 2 | paste -d ' ' - - | awk '{print $2 "/" tolower($1)}' | xargs du -sh 2> /dev/null | sort -hr 1.5G /app/.venv/lib/python3.11/site-packages/torch 419M /app/.venv/lib/python3.11/site-packages/triton 72M /app/.venv/lib/python3.11/site-packages/cmake 62M /app/.venv/lib/python3.11/site-packages/onnx 46M /app/.venv/lib/python3.11/site-packages/pandas 40M /app/.venv/lib/python3.11/site-packages/transformers 33M /app/.venv/lib/python3.11/site-packages/matplotlib 27M /app/.venv/lib/python3.11/site-packages/sympy 23M /app/.venv/lib/python3.11/site-packages/layoutparser 21M /app/.venv/lib/python3.11/site-packages/lxml 17M /app/.venv/lib/python3.11/site-packages/onnxruntime 15M /app/.venv/lib/python3.11/site-packages/pip 14M /app/.venv/lib/python3.11/site-packages/rapidfuzz 14M /app/.venv/lib/python3.11/site-packages/cryptography 13M /app/.venv/lib/python3.11/site-packages/sqlalchemy 13M /app/.venv/lib/python3.11/site-packages/debugpy 12M /app/.venv/lib/python3.11/site-packages/tokenizers 12M /app/.venv/lib/python3.11/site-packages/fastavro 11M /app/.venv/lib/python3.11/site-packages/torchvision 11M /app/.venv/lib/python3.11/site-packages/jedi 7.6M /app/.venv/lib/python3.11/site-packages/timm 7.1M /app/.venv/lib/python3.11/site-packages/networkx 6.5M /app/.venv/lib/python3.11/site-packages/unstructured 6.5M /app/.venv/lib/python3.11/site-packages/nltk 6.4M /app/.venv/lib/python3.11/site-packages/tiktoken 5.5M /app/.venv/lib/python3.11/site-packages/pydantic_core 5.3M /app/.venv/lib/python3.11/site-packages/kiwisolver 4.9M /app/.venv/lib/python3.11/site-packages/safetensors 4.8M /app/.venv/lib/python3.11/site-packages/pygments 4.8M /app/.venv/lib/python3.11/site-packages/aiohttp 3.1M /app/.venv/lib/python3.11/site-packages/emoji 2.9M /app/.venv/lib/python3.11/site-packages/regex 2.7M /app/.venv/lib/python3.11/site-packages/tzdata 2.7M /app/.venv/lib/python3.11/site-packages/pytz 2.7M /app/.venv/lib/python3.11/site-packages/pikepdf 2.6M /app/.venv/lib/python3.11/site-packages/setuptools 2.4M /app/.venv/lib/python3.11/site-packages/langdetect 2.2M /app/.venv/lib/python3.11/site-packages/greenlet 2.1M /app/.venv/lib/python3.11/site-packages/mpmath 1.8M /app/.venv/lib/python3.11/site-packages/tornado 1.7M /app/.venv/lib/python3.11/site-packages/pydantic 1.6M /app/.venv/lib/python3.11/site-packages/pycocotools 1.5M /app/.venv/lib/python3.11/site-packages/openai 1.4M /app/.venv/lib/python3.11/site-packages/openpyxl 1.3M /app/.venv/lib/python3.11/site-packages/pypdf 1.3M /app/.venv/lib/python3.11/site-packages/joblib 1.3M /app/.venv/lib/python3.11/site-packages/authlib 1.2M /app/.venv/lib/python3.11/site-packages/chardet 1.1M /app/.venv/lib/python3.11/site-packages/yarl 1.1M /app/.venv/lib/python3.11/site-packages/psutil 984K /app/.venv/lib/python3.11/site-packages/contourpy 916K /app/.venv/lib/python3.11/site-packages/frozenlist 812K /app/.venv/lib/python3.11/site-packages/xlsxwriter 760K /app/.venv/lib/python3.11/site-packages/fastapi 724K /app/.venv/lib/python3.11/site-packages/fsspec 660K /app/.venv/lib/python3.11/site-packages/black 640K /app/.venv/lib/python3.11/site-packages/pycparser 528K /app/.venv/lib/python3.11/site-packages/jinja2 488K /app/.venv/lib/python3.11/site-packages/wcwidth 488K /app/.venv/lib/python3.11/site-packages/effdet 464K /app/.venv/lib/python3.11/site-packages/urllib3 452K /app/.venv/lib/python3.11/site-packages/multidict 452K /app/.venv/lib/python3.11/site-packages/ipykernel 436K /app/.venv/lib/python3.11/site-packages/jupyter_client 412K /app/.venv/lib/python3.11/site-packages/pyparsing 412K /app/.venv/lib/python3.11/site-packages/cffi 412K /app/.venv/lib/python3.11/site-packages/anyio 396K /app/.venv/lib/python3.11/site-packages/olefile 388K /app/.venv/lib/python3.11/site-packages/omegaconf 384K /app/.venv/lib/python3.11/site-packages/xlrd 376K /app/.venv/lib/python3.11/site-packages/parso 372K /app/.venv/lib/python3.11/site-packages/markdown 368K /app/.venv/lib/python3.11/site-packages/traitlets 364K /app/.venv/lib/python3.11/site-packages/click 344K /app/.venv/lib/python3.11/site-packages/httpx 336K /app/.venv/lib/python3.11/site-packages/pyflakes 328K /app/.venv/lib/python3.11/site-packages/starlette 328K /app/.venv/lib/python3.11/site-packages/httpcore 312K /app/.venv/lib/python3.11/site-packages/humanfriendly 308K /app/.venv/lib/python3.11/site-packages/certifi 292K /app/.venv/lib/python3.11/site-packages/uvicorn 288K /app/.venv/lib/python3.11/site-packages/idna 280K /app/.venv/lib/python3.11/site-packages/wrapt 252K /app/.venv/lib/python3.11/site-packages/flake8 252K /app/.venv/lib/python3.11/site-packages/cassio 248K /app/.venv/lib/python3.11/site-packages/tqdm 248K /app/.venv/lib/python3.11/site-packages/pypdfium2 248K /app/.venv/lib/python3.11/site-packages/h11 244K /app/.venv/lib/python3.11/site-packages/h2 244K /app/.venv/lib/python3.11/site-packages/cohere 232K /app/.venv/lib/python3.11/site-packages/hpack 224K /app/.venv/lib/python3.11/site-packages/pipdeptree 224K /app/.venv/lib/python3.11/site-packages/pexpect 220K /app/.venv/lib/python3.11/site-packages/requests 204K /app/.venv/lib/python3.11/site-packages/marshmallow 184K /app/.venv/lib/python3.11/site-packages/pdfplumber 184K /app/.venv/lib/python3.11/site-packages/packaging 156K /app/.venv/lib/python3.11/site-packages/astrapy 152K /app/.venv/lib/python3.11/site-packages/coloredlogs 148K /app/.venv/lib/python3.11/site-packages/soupsieve 148K /app/.venv/lib/python3.11/site-packages/iopath 116K /app/.venv/lib/python3.11/site-packages/vulture 112K /app/.venv/lib/python3.11/site-packages/flatbuffers 108K /app/.venv/lib/python3.11/site-packages/validators 108K /app/.venv/lib/python3.11/site-packages/jupyter_core 108K /app/.venv/lib/python3.11/site-packages/filetype 104K /app/.venv/lib/python3.11/site-packages/tabulate 96K /app/.venv/lib/python3.11/site-packages/pillow_heif 96K /app/.venv/lib/python3.11/site-packages/pathspec 96K /app/.venv/lib/python3.11/site-packages/dirtyjson 88K /app/.venv/lib/python3.11/site-packages/platformdirs 88K /app/.venv/lib/python3.11/site-packages/markupsafe 88K /app/.venv/lib/python3.11/site-packages/executing 84K /app/.venv/lib/python3.11/site-packages/asttokens 76K /app/.venv/lib/python3.11/site-packages/tenacity 68K /app/.venv/lib/python3.11/site-packages/toml 68K /app/.venv/lib/python3.11/site-packages/geomet 64K /app/.venv/lib/python3.11/site-packages/distro 60K /app/.venv/lib/python3.11/site-packages/pypandoc 60K /app/.venv/lib/python3.11/site-packages/portalocker 56K /app/.venv/lib/python3.11/site-packages/backoff 48K /app/.venv/lib/python3.11/site-packages/ptyprocess 48K /app/.venv/lib/python3.11/site-packages/pdf2image 48K /app/.venv/lib/python3.11/site-packages/hyperframe 44K /app/.venv/lib/python3.11/site-packages/filelock 32K /app/.venv/lib/python3.11/site-packages/deprecated 32K /app/.venv/lib/python3.11/site-packages/attrs 24K /app/.venv/lib/python3.11/site-packages/zipp 24K /app/.venv/lib/python3.11/site-packages/termcolor 24K /app/.venv/lib/python3.11/site-packages/sniffio 24K /app/.venv/lib/python3.11/site-packages/pytesseract 24K /app/.venv/lib/python3.11/site-packages/cycler 24K /app/.venv/lib/python3.11/site-packages/colorlog 20K /app/.venv/lib/python3.11/site-packages/comm 12K /app/.venv/lib/python3.11/site-packages/docx2txt 12K /app/.venv/lib/python3.11/site-packages/aiosignal 8.0K /app/.venv/lib/python3.11/site-packages/ruff ```

New to poetry, I've been through several community discussions covering the same kind of issue:

While the solution could've been as simple as adding extras = [ "cpu" ] or setting the /cpu branch as torch wheels source URL, that wasn't possible.

PyTorch does not implement a specific PEP-standard protocol consumers (such as Poetry) look for when enumerating the package wheels' index.

The last resort is bundling a list of .whl URLs passed to PyTorch's dependency source by intersecting:

Architecture	Python Version	Platform
arm64	3.9	darwin
x86	3.10	windows
aarch64	3.11	linux

Sticking to Python range 3.9 <> 3.12 as it was initially defined in pyconfig.toml.

The list got quite long, the build time took longer (3x compared to the original).

The XY problem?

Super-rag's vision is a highly available and scalable API backed by workers, thus looking at a microservice-oriented architecture.

Torch is a heavy-lifting CPU/GPU-bound toolkit meant to be decoupled from the IO-bound API. It is a use-case, and it is probable that as the project grows, other "strategies" will be implemented, each with their own use cases.

It is essential for workers' images to be minimal. The image size directly impacts launch-time [availability]: you want a worker's image to be pulled, loaded in memory and start as quickly as possible.

Therefore, analyzing and understanding the use cases for dependencies helps identifying common libraries or services defined as reusable granular image layers.

As for now, my proposed solution is to have individual images (or layers) that only ships with what is strictly necessary for a given triplet (platform, architecture, python version).

Feel free to brainstorm with me on this subject; ideas are always welcome!

homanp commented 4 months ago

As for now, my proposed solution is to have individual images (or layers) that only ships with what is strictly necessary for a given triplet (platform, architecture, python version).

I agree with this. Not sure that the individual lib/packages currently being bundled in (specifically for encoders) is the best route forward.

How would one accomplish the layering part?!

teocns commented 4 months ago

@homanp How would one accomplish the layering part?!

Good question!

It rarely happens to find a one-fits-all solution, nevertheless I find inspiration in the those that drive modern technology (Elastic Search, Netflix, Kubernetes).

To have aclearer Idea I'd need to understand the expected super-rag workflow in detail (maybe an flow chart), but essentially we are looking at two fundamental design patterns:

a) Producer-Consumer

Traditional Queue-Broker/Exchange-Celery pub/sub as you know it

b) Control Plane - Data plane

The pipeline is entirely driven by the topic message [payload]. Comparable to langgraphs' "Graph", composed by nodes and edges that implementing an upstream/downstream communication system.

What I can tell from my experience is the traditional (a) design might not fit the kind of scalability modern technology demands (super-rag might be the case), unless you want to have DevOps team shooting themselves in the foot as queues start spilling overnight.

You may find detailed context in these articles:

elisalimli commented 4 months ago

As for now, my proposed solution is to have individual images (or layers) that only ships with what is strictly necessary for a given triplet (platform, architecture, python version).

@teocns I don't think we should worry about this since we prefer horizontal scaling over serverless functions at the moment. So that's not big deal I guess

teocns commented 4 months ago

@elisalimli Are we looking at a monolithic multi-threaded worker application sharing the same process runtime?

homanp commented 4 months ago

I added some optimisations in there now to allow for some concurrency without hitting rate limits.

teocns commented 4 months ago

@homanp perfect solution for not blocking I/O, though let's keep in mind that it operates on one CPU core.

Based on how we think of deploying this in the future, we might want to use multiprocessing pools and let the user specify n parallelism factor or fall back to the host number of CPUs

homanp commented 4 months ago

@homanp perfect solution for not blocking I/O, though let's keep in mind that it operates on one CPU core.

Based on how we think of deploying this in the future, we might want to use multiprocessing pools and let the user specify n parallelism factor or fall back to the host number of CPUs

Makes sense

homanp commented 4 months ago

I see two ways forward:

1) Keep to using lightweight SDK wrappers around unstructured library. This would require the user to spin up an instance of unstructured on their own and pass in the config.

2) Decouple unstructured as a micro-service similar to @teocns idea.

Not sure which approach is best atm.

elisalimli commented 4 months ago

@elisalimli Are we looking at a monolithic multi-threaded worker application sharing the same process runtime?

for the current moment, yes.

homanp commented 4 months ago

I have now decoupled the unstructured package and am only utilising the client SDK. Makes the API much more lightweight but also gives the user to run locally.

elisalimli commented 3 months ago

@teocns we have implemented this is in #91

superagent-ai / super-rag