Closed BeachWang closed 1 month ago
Perhaps it is better to have a pre-check
script (before running process) that installs the required packages by analyzing the .yaml
configuration.
Compared to installing dependencies for specific OPs, this approach has the advantage of accommodating different algorithms for an OP that may require different PyPI packages based on the configuration. A typical example is spaCy
, where we might install different packages (actually models) depending on the language (en, zh, fr, ...) and type (sm, md, lg, trf, or even with cuda
support).
Update: we are working on making Data-Juicer as service,TODO things include providing one-command script for installation and deployment of Data-Juicer-service, defilement of easy-to-use interfaces about Data-Juicer existing abilities, and some integration practices incorporating the DJ service into other projects like AgentScope and Cloud-Native applications.
@yxdyc @HYLcool
In branch service/fastapi, I implemented a prototype for the DJ service (still needs further improvement but is available for initial testing).
To keep the API invocation simple, stateless
operator calls have been implemented, where each function call dynamically instantiates an object. For example, an operator can be invoked like this:
curl -X POST "http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/run?dataset=xxx" \ # Other methods like compute_stats and process are also callable
-H "Content-Type: application/json" \
-d '{"min_len": 10, "max_len": "100"}'
In this approach, parameters for the called method are passed as query parameters in the URL, while parameters for the __init__
method are sent as the JSON payload. The TextLengthFilter OP will be automatically instantiated before executing the run
method. This design allows for a single API call without the need for a separate object instantiation step. This works because most of the operators (though not fully verified) are designed to function as stateless method invocations. For the relevant Filter operators, the compute_stats
method typically writes results to datasets without altering instance attributes, and the process
method filters based solely on defined thresholds without modifying instance properties.
If this approach seems good, we can move forward with further development and refinement; if not, we can look into implementing stateful service calls.
Start the service from the project directory:
uvicorn service:app --reload --log-level debug
In another terminal, access the endpoint:
curl -X POST "http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/use_cuda"
This design demonstrates simplicity and good scalability.
I think we can discuss a bit more about the following aspects before moving forward:
As the title say, the dependency of Data-Juicer is heavy. I must install the total environment if I only want to use one OP.
TODO: Set installment for each OP, for example
pip install .[OP_NAME]