Heavy dependency of Data-Juicer

BeachWang commented 2 months ago

As the title say, the dependency of Data-Juicer is heavy. I must install the total environment if I only want to use one OP.

TODO: Set installment for each OP, for example pip install .[OP_NAME]

drcege commented 2 months ago

Perhaps it is better to have a pre-check script (before running process) that installs the required packages by analyzing the .yaml configuration.

Compared to installing dependencies for specific OPs, this approach has the advantage of accommodating different algorithms for an OP that may require different PyPI packages based on the configuration. A typical example is spaCy, where we might install different packages (actually models) depending on the language (en, zh, fr, ...) and type (sm, md, lg, trf, or even with cuda support).

yxdyc commented 2 months ago

Update: we are working on making Data-Juicer as service,TODO things include providing one-command script for installation and deployment of Data-Juicer-service, defilement of easy-to-use interfaces about Data-Juicer existing abilities, and some integration practices incorporating the DJ service into other projects like AgentScope and Cloud-Native applications.

drcege commented 2 months ago

@yxdyc @HYLcool

In branch service/fastapi, I implemented a prototype for the DJ service (still needs further improvement but is available for initial testing).

To keep the API invocation simple, stateless operator calls have been implemented, where each function call dynamically instantiates an object. For example, an operator can be invoked like this:

curl -X POST "http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/run?dataset=xxx" \  # Other methods like compute_stats and process are also callable
   -H "Content-Type: application/json" \
   -d '{"min_len": 10, "max_len": "100"}'

In this approach, parameters for the called method are passed as query parameters in the URL, while parameters for the __init__ method are sent as the JSON payload. The TextLengthFilter OP will be automatically instantiated before executing the run method. This design allows for a single API call without the need for a separate object instantiation step. This works because most of the operators (though not fully verified) are designed to function as stateless method invocations. For the relevant Filter operators, the compute_stats method typically writes results to datasets without altering instance attributes, and the process method filters based solely on defined thresholds without modifying instance properties.

If this approach seems good, we can move forward with further development and refinement; if not, we can look into implementing stateful service calls.

Simple testing

Start the service from the project directory:

uvicorn service:app --reload --log-level debug

In another terminal, access the endpoint:

curl -X POST "http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/use_cuda"

yxdyc commented 2 months ago

This design demonstrates simplicity and good scalability.

The Stateless operators and dynamical instantiation is beneficial for concurrency and preventing side effects between requests.
Besides, the URL query parameters for method-specific parameters and JSON payloads for initialization parameters is a common RESTful practice, which makes the API interface intuitive and aligns well with REST principles.

I think we can discuss a bit more about the following aspects before moving forward:

Does this stateless design works with our OP-fusion mechanism? Besides, can statefulness provide significant benefits (such as reduced latency or better resource utilization)? It so, it may be worth revisiting the design to include stateful services where appropriate.
With this feature, we need make more efforts on several new things:
- error handling. Especially how we make the API quests and responses highly usable and informative.
- robust documentation. We need include detailed descriptions of endpoints, required parameters, and expected responses in the API documentation.
- security considerations. Since we are using JSON payloads for sensitive information like initialization parameters, we may need to implement security measures such as HTTPS and validating the source of incoming requests to prevent unauthorized access.

modelscope / data-juicer

Heavy dependency of Data-Juicer #398

Simple testing