How can we leverage the GPU for inference formatting?

HSLUCKY commented 5 months ago

Describe the issue as clearly as possible:

When utilizing the Outliens VLLM mode for JSON and regex inference, the GPU utilization remains at 0%, while the CPU utilization peaks at 100%. What could be the reason behind this phenomenon? How can we leverage the GPU for inference formatting? Furthermore, when inferring with JSON mode on problems of length 2048, CPU utilization reaches 100% while GPU utilization remains at 0%. Consequently, the process appears to hang indefinitely without termination, indicating a potential deadlock.

We've integrated Outliens into FastChat.

Steps/code to reproduce the bug:

{
    "model": "Qwen1.5-7B-Chat",
    "prompt": "<|im_start|>system\n 基于以下已知的信息, 专业、简要的回答用户的问题,\n            如果无法从提供的内容中获取答案, 请说: '知识库中提供的内容不足以回答此问题' 禁止胡乱编造, 回答的时候最好按照1.2.3.点进行总结。 \n            已知内容: \n            测试环境三台机子分别称为：CDH01,CDH02,CHD03\n5整体安装步骤\n得到安装好centos系统，并且网络正常的集群机器后，安装步骤分为七步：\n\n\n\n第一步 检查生产环境\n1确认硬件信息\n硬件信息一般装系统的同学会提供，且集群的机器也会按照我们的要求进行配置，检验几个核心的系统和硬件参数：\n内存、磁盘空间和目录划分、系统核数\n可用如下命令查看：\n查看内存大小：free -m\n\n查看磁盘空间：df -h\n\n查看CPU核数：cat /proc/cpuinfo | grep 'processor' | sort | wc -l\n\n2确认系统信息\n保证操作系统是要求的centos系列，最低要求7.4，位数建议64位\n使用命令查看Linux版本：\ncat /etc/redhat-release\n\n使用命令查看内核版本：\nuname -r\n\n3确认网络\n保证各机器已经分配内网IP，且各机器之间互通\n利用 ip add命令查看IP，用ping命令验证网络互通\n\n注意: 最好使用静态IP，防止IP发生变更!\n第二步 优化、修改系统配置（每个节点）\n点击metastore manager可以查看表信息\n\n点击query editors可以输入sql语句查询hive表\n\n\n9添加Spark服务\n1、登录CM界面，添加服务――下拉“操作”――添加服务――选择spark\n\n\n选择spark服务，然后配置相应的服务节点\n\nSpark服务要依赖CDH的Zookeeper、HDFS、YARN\n确定后，一路继续下去既可，等待安装完成既可！\n2、修改CDH6.2中Spark的动态资源分配配置（Hive默认用Spark作为引擎时）\nspark.dynamicallocation.enabled=true 这个配置会导致，spark-submit提交的任务中，资源配置失效，因为会配置的是自动进行分配，所以要将配置改为false\n该配置位于Hive的HiveServer2服务中\n\n去掉勾既可！\n\n10添加Kafka服务\n登录CM界面，添加服务――下拉“操作”――添加服务――选择spark\n确定服务角色分配\n\n配置使用默认配置既可，点击继续完成安装。\n问题记录\n1 YARN 问题记录\n1、userid 选项默认1000 需要修改为100\n第零步 简介\n1目的\n该手册旨在标准化公司内部人员部署大数据CDH集群的流程，提高利用CM部署集群的效率和减少出现问题的可能性，帮助开发人员更好，更快地完成部署任务。\n2前提要求\n搭建测试环境时按该手册要求，需要部署环境已经安装好操作系统，建议centos7.4，且网络已经配置完毕（不要求连互联网）。\n获取离线软件安装包，由公司提供，目录结构如下，具体是做什么用的，在下面的内容中会提到：\n\n3典型安装环境\n一般情况下，给到测试环境大数据的是几台安装好操作系统的虚拟机，典型的配置如下：\n操作系统版本：centos7.4  / 64位\n内存：4G（Master），2G（Agent）\n磁盘：50G\n网络环境：专网（局域网），无法连通互联网，因此全部需要离线安装\n系统核数：2核\n4术语列表\nCloudera Manager（CM）：Cloudera公司的Hadoop系统组件的安装管理工具\nCDH：Cloudera's Distribution Hadoop，是Cloudera公司发布的Hadoop版本\n测试环境三台机子分别称为：CDH01,CDH02,CHD03\n5整体安装步骤\nsystemctl disable firewalld\n注意: 关闭防火墙是因为在CDH组件中需要开放的端口是非常多的，为了避免多余的操作，直接将防火墙关掉即可，当然，如果为了安全考虑，你可以将端口一一开放出去。\n1.5禁用selinux\n临时关闭：\nsetenforce 0 \n永久关闭：\nvi /etc/selinux/config  \n将SELINUX=enforcing改成SELINUX=disabled\n\n检查状态：\nsestatus\n注意: 要求SELinux status的状态为disabled才是正确的，但如果使用临时关闭的方式去关闭的话则状态是不会发生变更的，只要自己确保已经关闭了即可。\n\n\n1.6 配置本地yum源（Http协议）\n此步骤是为了安装CM各种依赖的插件，因为默认的裸操作系统没有CM需要的大部分插件，但安装CM时又是必须的，在没有联网的情况下，想要通过yum安装CM需要的插件，必须配置本地的yum源。\n1.6.1 离线安装createrepo服务\n注意: CDH01节点参考第三步1，仅CDH01节点安装即可 !\n1.6.2 离线安装Httpd服务\n2 CDH6.0安装\n  \n\n\n\n继续下一步：\n\n点击更多选项，添加cdh6的安装包地址：\n注意: 这边设置的地址是我们在第三步2中已经配置好的Http服务地址。\n\n保存更改后会出现 CDH-6.0.1-1...\n\n点击继续，选择安装JDK\n\n点击继续：\n\n输入3台主机的root帐号密码，点击继续：\n\n等待Agent安装完成，CMS会给未安装CMA的节点装上CMA和Java JDK\n\n等待CMS安装CDH6.0，安装完成后点击继续，跳转到主机检查界面：\n\n检查告警无影响，点击继续：\n下面开始进行CDH6.0.0相关服务组件的安装：\n\n选择自定义服务，然后往下按顺序进行服务的添加：\n第六步 安装CDH组件\n1添加Zookeeper服务\n1、登录CM界面，添加服务――下拉“操作”――添加服务――选择zookerper\n分配主机和角色：\n\n这里选择dn1.gdjgj，dn2.gdjgj和dn3.gdjgj三台主机作zookeeper服务器，点击继续：\n\n采用默认配置即可，点击继续，完成Zookeeper的安装既可！\n2、Zookeeper服务验证\n            问题:\n            测试环境三台机子分别称为,请使用和用户相同的语言进行回答.\n<|im_end|>\n<|im_start|>user\n测试环境三台机子分别称为<|im_end|>\n<|im_start|>assistant\n",
    "max_tokens": 4096,
    "temperature": 0.5,
    "is_format": true,
    "json_schema": "{\"type\": \"string\", \"maxLength\": 2048}"
}

Expected result:

15165 root      20   0 55.633g 0.014t 4.518g R  99.7 13.2   5:58.13 python3      
 2622 root      20   0 2761264  51092  14620 S   0.7  0.0 105:40.07 expressvpnd

Error message:

No response

Outlines/Python version information:

Version information

``` (command output here) ```

Context for the issue:

No response

HSLUCKY commented 5 months ago

've identified the root cause. It's due to Outliens creating a massive cache object during FSM initialization. Despite our server utilizing 77G VIRT MEMORY, the process didn't terminate, ultimately leading to memory exhaustion and process termination.

HSLUCKY commented 5 months ago

After removing the "maxLength": 2048 parameter, apart from the initial slow loading, subsequent executions are able to locate the cache and infer results immediately. {"type": "string", "maxLength": 2048}

HSLUCKY commented 5 months ago

def init(self, regex_string: str, tokenizer): @cache() def create_states_mapping( regex_string: str, cacheable_vocabulary: Tuple[Tuple[str, int], ...] ) -> Tuple[dict, set]: """Create the variables related to the mapping between states and tokens The parameters of the function are used for caching purpose """ regex_pattern = interegular.parse_pattern(regex_string) regexfsm, = make_deterministic_fsm(regex_pattern.to_fsm().reduce()) states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer( regex_fsm, tokenizer )

        # We make sure that it is possible to generate strings in the language
        # of the regular expression with the tokens present in the model's
        # vocabulary.
        if not any(
            regex_fsm.finals.intersection(v.values())
            for v in states_to_token_maps.values()
        ):
            raise ValueError(
                "The vocabulary does not allow us to build a sequence that matches the input regex"
            )

        return states_to_token_maps, empty_token_ids

    self.states_to_token_maps, self.empty_token_ids = create_states_mapping(
        regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    )
    self.vocabulary = list(tokenizer.vocabulary.values())
    self.eos_token_id = tokenizer.eos_token_id

The problematic code resides within the following lines in fsm.py. regexfsm, = make_deterministic_fsm(regex_pattern.to_fsm().reduce()) states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer( regex_fsm, tokenizer )

HSLUCKY commented 5 months ago

Can you leverage the GPU to generate the cache for improved speed and address the bug causing the generation of excessively large cache indices?

rlouf commented 5 months ago

Yes, maxLength generates a DFA with many repeated states, which leads to large memory usage. We are considering ways to dramatically reduce the memory usage, but they require low-level work (replacing the DFA with something else), which takes some time.

Gintasz commented 3 months ago

I've been using SGLang library, which depends on this outlines library for regex output constraint. I've observed a problem where if regex is used, then throughput performance is 37x lower than if it's not used, and GPU utilization stays < 5% during most of the work as well.

I wanted to ask, is there a better performance in outlines if Context-Free Grammar was used instead of regex?

Link to my issue on SGLang: https://github.com/sgl-project/sglang/issues/450

outlines-dev / outlines