huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.91k stars 773 forks source link

[building on windows] onig_sys/oniguruma two or more data types in declaration specifiers #1581

Open louis030195 opened 2 months ago

louis030195 commented 2 months ago

i'm using tokenizers with candle and one of my windows user is facing two issues when trying to build screenpipe

CPU only

PS D:\AI\Screen\screen-pipe> cargo build --release
   Compiling rustls-webpki v0.102.6
   Compiling onig_sys v69.8.1
   Compiling symphonia-utils-xiph v0.5.4
   Compiling hyper-tls v0.5.0
   Compiling http-body v1.0.1
   Compiling spm_precompiled v0.1.4
   Compiling idna v0.5.0
   Compiling tokio v1.39.2
   Compiling indexmap v1.9.3
   Compiling macro_rules_attribute v0.2.0
   Compiling ntapi v0.4.1
   Compiling which v6.0.1
   Compiling windows v0.54.0
   Compiling digest v0.10.7
   Compiling rayon-cond v0.3.0
   Compiling parking_lot v0.12.3
   Compiling fancy-regex v0.13.0
   Compiling zune-jpeg v0.4.13
   Compiling candle-nn v0.6.0
   Compiling tiff v0.9.1
   Compiling regex v1.10.5
   Compiling crossbeam-utils v0.8.20
   Compiling derive_builder v0.20.0
   Compiling esaxx-rs v0.1.10
   Compiling futures v0.3.30
   Compiling tokio-stream v0.1.15
   Compiling serde_plain v1.0.2
   Compiling qoi v0.4.1
   Compiling schannel v0.1.23
   Compiling axum-core v0.3.4
   Compiling cexpr v0.6.0
   Compiling serde_json v1.0.121
   Compiling tracing-core v0.1.32
   Compiling native-tls v0.2.12
   Compiling unicode-normalization-alignments v0.1.12
   Compiling thiserror v1.0.63
   Compiling reqwest v0.11.27
   Compiling bitflags v2.6.0
   Compiling heck v0.5.0
   Compiling fastrand v2.1.0
   Compiling itoa v1.0.11
   Compiling unicode_categories v0.1.1
   Compiling log v0.4.22
   Compiling lazy_static v1.5.0
   Compiling shlex v1.3.0
   Compiling cpufeatures v0.2.12
   Compiling strsim v0.11.1
   Compiling clap_lex v0.7.2
   Compiling rustc-hash v1.1.0
   Compiling ryu v1.0.18
   Compiling crc-catalog v2.4.0
   Compiling lazycell v1.3.0
   Compiling crossbeam-queue v0.3.11
   Compiling url v2.5.2
   Compiling clap_derive v4.5.11
   Compiling sha2 v0.10.8
The following warnings were emitted during compilation:

warning: onig_sys@69.8.1: In file included from oniguruma\src/regenc.h:36,
warning: onig_sys@69.8.1:                  from oniguruma\src/regint.h:103,
warning: onig_sys@69.8.1:                  from oniguruma\src\regexec.c:36:
warning: onig_sys@69.8.1: D:\AI\Screen\screen-pipe\target\release\build\onig_sys-308529629068d6af\out/config.h:32:15: error: two or more data types in declaration specifiers
warning: onig_sys@69.8.1:    32 | #define uid_t int
warning: onig_sys@69.8.1:       |               ^~~
warning: onig_sys@69.8.1: D:\AI\Screen\screen-pipe\target\release\build\onig_sys-308529629068d6af\out/config.h:33:15: error: two or more data types in declaration specifiers
warning: onig_sys@69.8.1:    33 | #define gid_t int
warning: onig_sys@69.8.1:       |               ^~~

error: failed to run custom build command for `onig_sys v69.8.1`

Caused by:
  process didn't exit successfully: `D:\AI\Screen\screen-pipe\target\release\build\onig_sys-d764c792d708edb0\build-script-build` (exit code: 1)
  --- stdout
  cargo:rerun-if-env-changed=RUSTONIG_DYNAMIC_LIBONIG
  cargo:rerun-if-env-changed=RUSTONIG_STATIC_LIBONIG
  cargo:rerun-if-env-changed=RUSTONIG_SYSTEM_LIBONIG
  OUT_DIR = Some(D:\AI\Screen\screen-pipe\target\release\build\onig_sys-308529629068d6af\out)
  TARGET = Some(x86_64-pc-windows-gnu)
  OPT_LEVEL = Some(3)
  HOST = Some(x86_64-pc-windows-gnu)
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-gnu
  CC_x86_64-pc-windows-gnu = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_gnu
  CC_x86_64_pc_windows_gnu = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
  RUSTC_WRAPPER = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some(false)
  CARGO_CFG_TARGET_FEATURE = Some(cmpxchg16b,fxsr,sse,sse2,sse3)
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-gnu
  CFLAGS_x86_64-pc-windows-gnu = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_gnu
  CFLAGS_x86_64_pc_windows_gnu = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  cargo:warning=In file included from oniguruma\src/regenc.h:36,
  cargo:warning=                 from oniguruma\src/regint.h:103,
  cargo:warning=                 from oniguruma\src\regexec.c:36:
  cargo:warning=D:\AI\Screen\screen-pipe\target\release\build\onig_sys-308529629068d6af\out/config.h:32:15: error: two or more data types in declaration specifiers
  cargo:warning=   32 | #define uid_t int
  cargo:warning=      |               ^~~
  cargo:warning=D:\AI\Screen\screen-pipe\target\release\build\onig_sys-308529629068d6af\out/config.h:33:15: error: two or more data types in declaration specifiers
  cargo:warning=   33 | #define gid_t int
  cargo:warning=      |               ^~~

  --- stderr

  error occurred: Command "gcc.exe" "-O3" "-ffunction-sections" "-fdata-sections" "-m64" "-I" "D:\\AI\\Screen\\screen-pipe\\target\\release\\build\\onig_sys-308529629068d6af\\out" "-I" "oniguruma\\src" "-o" "D:\\AI\\Screen\\screen-pipe\\target\\release\\build\\onig_sys-308529629068d6af\\out\\abd886268342579b-regexec.o" "-c" "oniguruma\\src\\regexec.c" with args gcc.exe did not execute successfully (status code exit code: 1).

warning: build failed, waiting for other jobs to finish...
    Building [=================>       ] 446/606: windows 

cuda

https://github.com/huggingface/candle/issues/353


just trying to make it work at least on CPU, tried to disable onig on windows because it seems optional in your Cargo.toml but then build complains not having onig.

how do you build tokenizers on windows?

ArthurZucker commented 1 month ago

You also have the possibility of using unstable_wasm to use fancy regex

ArthurZucker commented 1 month ago

You are right that it was suppose to be optional at some point 😓 now it's not