Closed BiqiangWang closed 8 months ago
你好,感谢你对Data-Juicer的关注与使用!
我们目前没有在windows系统上进行过完整的测试,根据你遇到的报错信息,我们注意到两个比较重要的点:
24/01/12 15:47:02 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
......
java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
......
根据这两条信息我们搜索后,发现有可能是spark与windows系统之间的一些问题,以下为两个可能有帮助的资料,你可以先参考与尝试下:
如还有后续问题,请随时联系我们~我们后续也会尝试完善在windows系统上的相关测试
参考这两篇以及其他的一些文章,我单独下载了这些文件并配置了环境变量,且添加至了系统动态链接库后,WARN Shell: Did not find winutils.exe
的报错初步被解决了。但是出现了新的问题:python Python worker failed to connect back.
,具体信息如下:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (WBQ executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
参考 https://stackoverflow.com/questions/70571389/py4jjavaerror-an-error-occurred-while-calling-zorg-apache-spark-api-python-pyt 配置 PYSPARK_PYTHON 环境依然无法解决这一问题,其他可见的回答似乎关注于代码层面
你好,我们正在windows系统上进行data-juicer的全面测试,windows上相关的兼容性问题我们正在尝试解决,如果对于这个问题我们有进一步消息或者新的发现了会及时与你同步,还请耐心等待,感谢理解~
This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.
Close this stale issue.
Before Reporting 报告之前
[x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
win11
Installation Method 安装方式
from source
Data-Juicer Version Data-Juicer版本
v0.1.3
Python Version Python版本
python3.9
Describe the bug 描述这个bug
根据质量分类器文档描述,使用 predict.py 来预测一个文档的“质量”分数,执行
python .\tools\quality_classifier\predict.py .\demos\data\demo-dataset-deduplication.jsonl .\outputs\demo-quality\demo-quality.jsonl
, 完整提示信息如下:To Reproduce 如何复现
无代码编辑,使用demo/data文件夹中的数据集
Configs 配置信息
No response
Logs 报错日志
Screenshots 截图
No response
Additional 额外信息
No response