gomate-community / GoMate

GoMate:RAG Framework within Reliable input,Trusted output
557 stars 51 forks source link

Features-Retrievers-BM25 #6

Closed yanqiangmiffy closed 5 months ago

yanqiangmiffy commented 5 months ago

https://github.com/nmslib/hnswlib/issues/442 https://github.com/castorini/pyserini

yanqiangmiffy commented 5 months ago

https://github.com/nmslib/hnswlib/issues/442#issuecomment-1519066101

windows环境下安装pyserini报错,python3.11编译nmslib失败

    C:\Users\yanqiang\AppData\Local\Temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\similarity_search\include\experiments.h(175): note: 查看对正在编译的函数 模板 实例化“void similarity::ParallelFor<similarity::Experiments<dist_t>::Execute::<lambda_1c8e25ed05513503c633f4a3d08a6ce1>>(size_t,size_t,size_t,Function)”的引用
              with
              [
                  dist_t=float,
                  Function=similarity::Experiments<float>::Execute::<lambda_1c8e25ed05513503c633f4a3d08a6ce1>
              ]
      C:\Users\yanqiang\AppData\Local\Temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\similarity_search\include\thread_pool.h(90): warning C4267: “参数”: 从“size_t”转换到“unsigned int”,可能丢失数据
      C:\Users\yanqiang\AppData\Local\Temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\similarity_search\include\thread_pool.h(90): warning C4267: “参数”: 从“size_t”转换到“unsigned int”,可能丢失数据
      "H:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Tools\MSVC\14.40.33807\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I.\similarity_search\include -Itensorflow -Ic:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd
94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -Ic:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -Ic:\users\yanqiang\appdata\local\temp\pip-install-
2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -Ic:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -IF:\ProgramData\anacond
a3\Lib\site-packages\numpy\core\include -IF:\ProgramData\anaconda3\include -IF:\ProgramData\anaconda3\Include "-IH:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Tools\MSVC\14.40.33807\include" "-IH:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Too
ls\MSVC\14.40.33807\ATLMFC\include" "-IH:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Progra
m Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tp.\similarity_search\src\space.cc /Fobuild\temp.win-amd64-cpython-311\Release\.\similarity_search\src\space.obj /EHsc /openmp /O2 /DVERSION_INFO=\\\"2.1.1\\\"
      space.cc
      C:\Users\yanqiang\AppData\Local\Temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\similarity_search\include\object.h(127): warning C4267: “=”: 从“size_t”转换到“int”,可能丢失数据
      .\similarity_search\src\space.cc(35): warning C4267: “参数”: 从“size_t”转换到“similarity::IdType”,可能丢失数据
      .\similarity_search\src\space.cc(35): note: 模板实例化上下文(最早的实例化上下文)为
      .\similarity_search\src\space.cc(108): note: 查看对正在编译的 类 模板 实例化“similarity::Space<int>”的引用
      .\similarity_search\src\space.cc(27): note: 在编译 类 模板 成员函数“std::unique_ptr<similarity::DataFileInputState,std::default_delete<similarity::DataFileInputState>> similarity::Space<int>::ReadDataset(similarity::ObjectVector &,std::vector<std::string,std::allocator<std::string>> &,const std::string &,const similarity::IdTypeUnsign) const”时
      "H:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Tools\MSVC\14.40.33807\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I.\similarity_search\include -Itensorflow -Ic:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd
94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -Ic:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -Ic:\users\yanqiang\appdata\local\temp\pip-install-
2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -Ic:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include -IF:\ProgramData\anacond
a3\Lib\site-packages\numpy\core\include -IF:\ProgramData\anaconda3\include -IF:\ProgramData\anaconda3\Include "-IH:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Tools\MSVC\14.40.33807\include" "-IH:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Too
ls\MSVC\14.40.33807\ATLMFC\include" "-IH:\Program Files (x86)\Microsoft Visual Studio\Installer\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Progra
m Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tp.\similarity_search\src\space\space_ab_diverg.cc /Fobuild\temp.win-amd64-cpython-311\Release\.\similarity_search\src\space\space_ab_diverg.obj /EHsc /openmp /O2 /DVERSION_INFO=\\\"2.1.1\\\"
      space_ab_diverg.cc
      .\similarity_search\include\space/space_ab_diverg.h(1): warning C4819: 该文件包含不能在当前代码页(936)中表示的字符。请将该文件保存为 Unicode 格式以防止数据丢失
      .\similarity_search\include\object.h(127): warning C4267: “=”: 从“size_t”转换到“int”,可能丢失数据
      .\similarity_search\include\distcomp.h(1): warning C4819: 该文件包含不能在当前代码页(936)中表示的字符。请将该文件保存为 Unicode 格式以防止数据丢失
      .\similarity_search\include\distcomp.h(260): warning C4244: “初始化”: 从“size_t”转换到“float”,可能丢失数据
      .\similarity_search\src\space\space_ab_diverg.cc(34): warning C4267: “参数”: 从“size_t”转换到“const int”,可能丢失数据
      .\similarity_search\src\space\space_ab_diverg.cc(34): note: 模板实例化上下文(最早的实例化上下文)为
      .\similarity_search\src\space\space_ab_diverg.cc(55): note: 查看对正在编译的 类 模板 实例化“similarity::SpaceAlphaBetaDivergSlow<float>”的引用
      .\similarity_search\src\space\space_ab_diverg.cc(27): note: 在编译 类 模板 成员函数“dist_t similarity::SpaceAlphaBetaDivergSlow<dist_t>::HiddenDistance(const similarity::Object *,const similarity::Object *) const”时
              with
              [
                  dist_t=float
              ]
      c:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include\pybind11/pybind11.h(2223): error C2027: 使用了未定义类型“_frame”
      F:\ProgramData\anaconda3\include\pytypedefs.h(22): note: 参见“_frame”的声明
      c:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include\pybind11/pybind11.h(2222): error C2660: “PyDict_GetItem”: 函数不接受 1 个参数
      F:\ProgramData\anaconda3\include\dictobject.h(22): note: 参见“PyDict_GetItem”的声明
      c:\users\yanqiang\appdata\local\temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\.eggs\pybind11-2.6.1-py3.11.egg\pybind11\include\pybind11/pybind11.h(2222): note: 尝试匹配参数列表“()”时
      C:\Users\yanqiang\AppData\Local\Temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\similarity_search\include\object.h(127): warning C4267: “=”: 从“size_t”转换到“int”,可能丢失数据
      C:\Users\yanqiang\AppData\Local\Temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\similarity_search\include\distcomp.h(1): warning C4819: 该文件包含不能在当前代码页(936)中表示的字符。请将该文件保存为 Unicode 格式以防止数据丢失
      C:\Users\yanqiang\AppData\Local\Temp\pip-install-2kfmp7iy\nmslib_f749ac918bd94c558de5595a8acc1d8b\similarity_search\include\distcomp.h(260): warning C4244: “初始化”: 从“size_t”转换到“float”,可能丢失数据
      nmslib.cc(454): error C2017: 非法的转义序列
      nmslib.cc(454): error C2001: 常量中有换行符
      nmslib.cc(459): error C2143: 语法错误: 缺少“)”(在“pybind11::enum_<similarity::DistType>”的前面)
      nmslib.cc(459): error C2143: 语法错误: 缺少“;”(在“pybind11::enum_<similarity::DistType>”的前面)
      nmslib.cc(730): warning C4267: “=”: 从“size_t”转换到“T”,可能丢失数据
              with
              [
                  T=int
              ]
      error: command 'H:\\Program Files (x86)\\Microsoft Visual Studio\\Installer\\VC\\Tools\\MSVC\\14.40.33807\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for nmslib
  Running setup.py clean for nmslib
Failed to build nmslib
ERROR: Could not build wheels for nmslib, which is required to install pyproject.toml-based projects
yanqiangmiffy commented 5 months ago

nmslib/hnswlib#442 (comment)

py38 环境下可以安装成功

https://pypi.org/project/nmslib/#files

yanqiangmiffy commented 5 months ago

Extend BM25Retriever to work with non-Elasticsearch

https://github.com/deepset-ai/haystack/issues/3509

yanqiangmiffy commented 5 months ago

https://pypi.org/project/rank-bm25/

https://github.com/dorianbrown/rank_bm25

yanqiangmiffy commented 5 months ago

https://github.com/gomate-community/GoMate/commit/50eaa3d9c818e4113c9ae2fecd0d5f06a7f131f7

yanqiangmiffy commented 5 months ago

相关原理: