modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.93k stars 175 forks source link

Heavy dependency of Data-Juicer #398

Closed BeachWang closed 1 month ago

BeachWang commented 2 months ago

As the title say, the dependency of Data-Juicer is heavy. I must install the total environment if I only want to use one OP.

TODO: Set installment for each OP, for example pip install .[OP_NAME]

drcege commented 2 months ago

Perhaps it is better to have a pre-check script (before running process) that installs the required packages by analyzing the .yaml configuration.

Compared to installing dependencies for specific OPs, this approach has the advantage of accommodating different algorithms for an OP that may require different PyPI packages based on the configuration. A typical example is spaCy, where we might install different packages (actually models) depending on the language (en, zh, fr, ...) and type (sm, md, lg, trf, or even with cuda support).

yxdyc commented 2 months ago

Update: we are working on making Data-Juicer as service,TODO things include providing one-command script for installation and deployment of Data-Juicer-service, defilement of easy-to-use interfaces about Data-Juicer existing abilities, and some integration practices incorporating the DJ service into other projects like AgentScope and Cloud-Native applications.

drcege commented 2 months ago

@yxdyc @HYLcool

In branch service/fastapi, I implemented a prototype for the DJ service (still needs further improvement but is available for initial testing).

To keep the API invocation simple, stateless operator calls have been implemented, where each function call dynamically instantiates an object. For example, an operator can be invoked like this:

curl -X POST "http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/run?dataset=xxx" \  # Other methods like compute_stats and process are also callable
   -H "Content-Type: application/json" \
   -d '{"min_len": 10, "max_len": "100"}'

In this approach, parameters for the called method are passed as query parameters in the URL, while parameters for the __init__ method are sent as the JSON payload. The TextLengthFilter OP will be automatically instantiated before executing the run method. This design allows for a single API call without the need for a separate object instantiation step. This works because most of the operators (though not fully verified) are designed to function as stateless method invocations. For the relevant Filter operators, the compute_stats method typically writes results to datasets without altering instance attributes, and the process method filters based solely on defined thresholds without modifying instance properties.

If this approach seems good, we can move forward with further development and refinement; if not, we can look into implementing stateful service calls.


Simple testing

  1. Start the service from the project directory:

    uvicorn service:app --reload --log-level debug
  2. In another terminal, access the endpoint:

    curl -X POST "http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/use_cuda"
yxdyc commented 2 months ago

This design demonstrates simplicity and good scalability.

I think we can discuss a bit more about the following aspects before moving forward: