huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
672 stars 72 forks source link

e2e is broken due to KenLM install #2636

Closed severo closed 5 months ago

severo commented 5 months ago

We get:

Note: This error originates from the build backend, and is likely not a problem with poetry but with kenlm (0.2.0 https://github.com/kpu/kenlm/archive/master.zip) not supporting PEP 517 builds. You can verify this by running 'pip wheel --no-cache-dir --use-pep517 "kenlm @ https://github.com/kpu/kenlm/archive/master.zip"'.

Running this command gives:

$ poetry run pip wheel --no-cache-dir --use-pep517 "kenlm @ https://github.com/kpu/kenlm/archive/master.zip"
Collecting kenlm@ https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip
     \ 553.6 kB 4.7 MB/s 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for kenlm (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [60 lines of output]
      Will build with KenLM max_order set to 6
      running bdist_wheel
      running build
      running build_ext
      Traceback (most recent call last):
        File "/tmp/pip-build-env-irk6jevh/overlay/bin/cmake", line 5, in <module>
          from cmake import cmake
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/cmake/__init__.py", line 6, in <module>
          from importlib_metadata import distribution
      ModuleNotFoundError: No module named 'importlib_metadata'
      Traceback (most recent call last):
        File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
          return _build_backend().build_wheel(wheel_directory, config_settings,
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 410, in build_wheel
          return self._build_with_temp_dir(
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 395, in _build_with_temp_dir
          self.run_setup()
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 487, in run_setup
          super().run_setup(setup_script=setup_script)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 124, in <module>
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 104, in setup
          return distutils.core.setup(**attrs)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 967, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 368, in run
          self.run_command("build")
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 967, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/command/build.py", line 131, in run
          self.run_command(cmd_name)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 967, in run_command
          super().run_command(command)
        File "/tmp/pip-build-env-irk6jevh/overlay/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "<string>", line 64, in run
        File "/home/slesage/.pyenv/versions/3.9.18/lib/python3.9/subprocess.py", line 424, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/home/slesage/.pyenv/versions/3.9.18/lib/python3.9/subprocess.py", line 528, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['cmake', '--version']' returned non-zero exit status 1.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for kenlm
Failed to build kenlm
ERROR: Failed to build one or more wheels
severo commented 5 months ago

I'm not sure we still need KenLM?

It would be interesting to:

severo commented 5 months ago
687 canonical datasets today
acronym_identification
ade_corpus_v2
adv_glue
aeslc
afrikaans_ner_corpus
ag_news
air_dialogue
ajgt_twitter_ar
allegro_reviews
allocine
alt
amazon_polarity
ambig_qa
ami
amttl
app_reviews
aqua_rat
aquamuse
ar_res_reviews
ar_sarcasm
arabic_billion_words
arabic_pos_dialect
arabic_speech_corpus
arcd
arsentd_lev
art
arxiv_dataset
ascent_kb
aslg_pc12
asnq
assin
assin2
atomic
autshumato
banking77
bbaw_egyptian
bbc_hindi_nli
bc2gm_corpus
beans
best2009
bible_para
big_patent
bigbench
billsum
bing_coronavirus_query_set
biomrc
biosses
biwi_kinect_head_pose
blended_skill_talk
blog_authorship_corpus
bn_hate_speech
bnl_newspapers
bookcorpus
bprec
break_data
brwac
bsd_ja_en
bswac
c3
cail2018
caner
capes
casino
catalonia_independence
cats_vs_dogs
cawac
cbt
cc_news
cc100
ccaligned_multilingual
cdsc
cdt
cedr
cfq
chr_en
cifar10
cifar100
circa
clickbait_news_bg
climate_fever
clinc_oos
clue
cmrc2018
cmu_hinglish_dog
cnn_dailymail
coached_conv_pref
coarse_discourse
codah
code_search_net
code_x_glue_cc_clone_detection_big_clone_bench
code_x_glue_cc_clone_detection_poj104
code_x_glue_cc_cloze_testing_all
code_x_glue_cc_cloze_testing_maxmin
code_x_glue_cc_code_completion_line
code_x_glue_cc_code_completion_token
code_x_glue_cc_code_refinement
code_x_glue_cc_code_to_code_trans
code_x_glue_cc_defect_detection
code_x_glue_ct_code_to_text
code_x_glue_tc_nl_code_search_adv
code_x_glue_tc_text_to_code
code_x_glue_tt_text_to_text
com_qa
common_language
compguesswhat
conceptnet5
conceptual_12m
conceptual_captions
conll2000
conll2002
conll2003
conll2012_ontonotesv5
conllpp
consumer-finance-complaints
conv_ai
conv_ai_2
conv_ai_3
conv_questions
cornell_movie_dialog
cos_e
cosmos_qa
counter
covid_qa_castorini
covid_qa_deepset
covid_qa_ucsd
covid_tweets_japanese
covost2
cppe-5
craigslist_bargains
crawl_domain
crd3
crime_and_punish
crows_pairs
cryptonite
cs_restaurants
cuad
curiosity_dialogs
daily_dialog
dane
danish_political_comments
dart
datacommons_factcheck
dbrd
deal_or_no_dialog
definite_pronoun_resolution
dengue_filipino
dialog_re
diplomacy_detection
disaster_response_messages
discofuse
discovery
disfl_qa
doc2dial
docred
doqa
dream
duorc
dutch_social
dyk
e2e_nlg
e2e_nlg_cleaned
ecb
ecthr_cases
eduge
ehealth_kd
electricity_load_diagrams
eli5_category
elkarhizketak
emea
emo
emotone_ar
empathetic_dialogues
enriched_web_nlg
enwik8
eraser_multi_rc
esnli
eth_py150_open
ethos
ett
eu_regulatory_ir
eurlex
euronews
europa_eac_tm
europa_ecdc_tm
europarl_bilingual
event2Mind
evidence_infer_treatment
factckbr
fake_news_english
fake_news_filipino
farsi_news
fashion_mnist
fever
few_rel
financial_phrasebank
finer
flores
flue
food101
fquad
freebase_qa
gap
gem
generated_reviews_enth
generics_kb
german_legal_entity_recognition
germaner
germeval_14
giga_fren
gigaword
glucose
glue
gnad10
go_emotions
gooaq
google_wellformed_query
grail_qa
great_code
gsm8k
guardian_authorship
gutenberg_time
hans
hansards
hard
harem
has_part
hate_offensive
hate_speech_filipino
hate_speech_pl
hate_speech_portuguese
hate_speech18
hatexplain
hausa_voa_ner
hausa_voa_topics
hda_nli_hindi
head_qa
health_fact
hebrew_projectbenyehuda
hebrew_sentiment
hebrew_this_world
hind_encorp
hindi_discourse
hippocorpus
hkcancor
hlgd
hope_edi
hotpot_qa
hover
hrenwac_para
hrwac
humicroedit
hybrid_qa
hyperpartisan_news_detection
iapp_wiki_qa_squad
id_clickbait
id_liputan6
id_nergrit_corpus
id_newspapers_2018
id_panl_bppt
id_puisi
igbo_english_machine_translation
igbo_monolingual
igbo_ner
ilist
imagenet_sketch
imagenet-1k
imdb_urdu_reviews
imppres
indic_glue
indonli
inquisitive_qg
interpress_news_category_tr
interpress_news_category_tr_lite
irc_disentangle
isixhosa_ner_corpus
isizulu_ner_corpus
iwslt2017
jeopardy
jnlpba
journalists_questions
kan_hope
kannada_news
kd_conv
kde4
kelm
kilt_tasks
kilt_wikipedia
kinnews_kirnews
klue
kor_3i4k
kor_hate
kor_ner
kor_nli
kor_nlu
kor_qpair
kor_sae
kor_sarcasm
labr
lama
lambada
large_spanish_corpus
laroseda
lc_quad
lccc
lener_br
liar
librispeech_asr
librispeech_lm
limit
lince
linnaeus
liveqa
lj_speech
lm1b
lst20
m_lama
mac_morpho
makhzan
masakhaner
math_dataset
math_qa
matinf
mbpp
mc_taco
md_gender_bias
mdd
med_hop
medal
medical_questions_pairs
menyo20k_mt
meta_woz
metashift
metooma
metrec
miam
mkb
mkqa
mlqa
mlsum
mnist
mocha
monash_tsf
moroco
movie_rationales
mrqa
ms_marco
ms_terms
msr_genomics_kbcomp
msr_sqa
msr_text_compression
msr_zhen_translation_parity
msra_ner
mt_eng_vietnamese
muchocine
multi_booked
multi_news
multi_nli_mismatch
multi_para_crawl
multi_re_qa
multi_woz_v22
multi_x_science_sum
multidoc2dial
multilingual_librispeech
mutual_friends
mwsc
myanmar_news
narrativeqa
narrativeqa_manual
natural_questions
ncbi_disease
nchlt
ncslgr
nell
neural_code_search
newsgroup
newsph
newsph_nli
newspop
newsqa
newsroom
nkjp-ner
nli_tr
nlu_evaluation_data
norec
norne
norwegian_ner
nq_open
nsmc
numer_sense
numeric_fused_head
oclar
offcombr
offenseval_dravidian
offenseval2020_tr
ofis_publik
ohsumed
ollie
omp
onestop_english
onestop_qa
open_subtitles
openai_humaneval
openslr
opinosis
opus_books
opus_dgt
opus_dogc
opus_elhuyar
opus_euconst
opus_finlex
opus_fiskmo
opus_gnome
opus_infopankki
opus_memat
opus_montenegrinsubs
opus_openoffice
opus_paracrawl
opus_rf
opus_tedtalks
opus_ubuntu
opus_wikipedia
opus_xhosanavy
opus100
orange_sum
oscar
para_crawl
para_pat
parsinlu_reading_comprehension
pass
paws
paws-x
pec
peoples_daily_ner
per_sent
persian_ner
pg19
php
pib
piqa
pn_summary
poem_sentiment
polemo2
poleval2019_cyberbullying
poleval2019_mt
polsum
polyglot_ner
prachathai67k
pragmeval
proto_qa
psc
ptb_text_only
pubmed
py_ast
qa_srl
qa_zre
qa4mre
qangaroo
qanta
qed
qed_amara
quac
quail
quarel
quickdraw
quora
quoref
re_dial
reasoning_bg
recipe_nlg
reclor
red_caps
reddit_tifu
refresd
reuters21578
riddle_sense
ro_sent
ro_sts
ro_sts_parallel
roman_urdu
roman_urdu_hate_speech
ronec
rotten_tomatoes
samsum
sanskrit_classic
saudinewsnet
sberquad
sbu_captions
scan
scb_mt_enth_2020
scene_parse_150
schema_guided_dstc8
scielo
scientific_papers
search_qa
sede
selqa
sem_eval_2010_task_8
sem_eval_2014_task_1
sem_eval_2018_task_1
sem_eval_2020_task_11
sent_comp
senti_lex
senti_ws
sentiment140
sepedi_ner
sesotho_ner_corpus
setimes
setswana_ner_corpus
sharc_modified
sick
silicone
simple_questions_v2
siswati_ner_corpus
smartdata
sms_spam
snips_built_in_intents
snow_simplified_japanese_corpus
so_stacksample
social_bias_frames
social_i_qa
sofc_materials_articles
sogou_news
spanish_billion_words
spc
species_800
speech_commands
spider
squad_adversarial
squad_es
squad_it
squad_kor_v1
squad_kor_v2
squad_v1_pt
squadshifts
srwac
sst
stereoset
story_cloze
stsb_mt_sv
stsb_multi_mt
style_change_detection
subjqa
super_glue
superb
svhn
swag
swahili
swahili_news
swda
swedish_medical_ner
swedish_ner_corpus
swedish_reviews
tab_fact
tamilmixsentiment
tanzil
tapaco
tashkeela
taskmaster1
taskmaster2
taskmaster3
tatoeba
ted_hrlr
ted_iwlst2013
ted_multi
ted_talks_iwslt
telugu_books
telugu_news
tep_en_fa_para
text2log
textvqa
thai_toxicity_tweet
thainer
thaiqa_squad
thaisum
tilde_model
time_dial
times_of_india_news_headlines
timit_asr
tiny_shakespeare
tlc
tmu_gfm_dataset
tne
told-br
totto
trec
truthful_qa
tsac
ttc4900
tunizi
tuple_ie
turk
turkic_xwmt
turkish_movie_sentiment
turkish_ner
turkish_product_reviews
turkish_shrinked_ner
turku_ner_corpus
tweet_eval
tweet_qa
tweets_ar_en_parallel
tweets_hate_speech_detection
twi_text_c3
twi_wordsim353
tydiqa
ubuntu_dialogs_corpus
udhr
um005
un_ga
un_multi
un_pc
universal_dependencies
universal_morphologies
urdu_fake_news
urdu_sentiment_corpus
vctk
visual_genome
vivos
web_nlg
web_of_science
web_questions
weibo_ner
wi_locness
wider_face
wiki_asp
wiki_atomic_edits
wiki_auto
wiki_bio
wiki_dpr
wiki_hop
wiki_lingua
wiki_movies
wiki_qa
wiki_qa_ar
wiki_snippets
wiki_source
wiki_split
wiki_summary
wiki40b
wikiann
wikicorpus
wikihow
wikisql
wikitablequestions
wikitext
wikitext_tl39
wili_2018
wino_bias
winograd_wsc
winogrande
wiqa
wisesight_sentiment
wisesight1000
wmt_t2t
wmt14
wmt15
wmt16
wmt17
wmt18
wmt19
wmt20_mlqe_task1
wmt20_mlqe_task2
wmt20_mlqe_task3
wnut_17
wongnai_reviews
woz_dialogue
wrbsc
x_stance
xcopa
xcsr
xed_en_fi
xglue
xnli
xor_tydi_qa
xquad
xquad_r
xsum_factuality
xtreme
yahoo_answers_qa
yahoo_answers_topics
yelp_polarity
yelp_review_full
yoruba_bbc_topics
yoruba_gv_ner
yoruba_text_c3
yoruba_wordsim353
youtube_caption_corrections
zest

And 27 other authorized datasets

espnet/yodas
gaia-benchmark/GAIA
google/fleurs
mozilla-foundation/common_voice_1_0
mozilla-foundation/common_voice_10_0
mozilla-foundation/common_voice_11_0
mozilla-foundation/common_voice_12_0
mozilla-foundation/common_voice_13_0
mozilla-foundation/common_voice_14_0
mozilla-foundation/common_voice_15_0
mozilla-foundation/common_voice_16_0
mozilla-foundation/common_voice_16_1
mozilla-foundation/common_voice_2_0
mozilla-foundation/common_voice_3_0
mozilla-foundation/common_voice_4_0
mozilla-foundation/common_voice_5_0
mozilla-foundation/common_voice_5_1
mozilla-foundation/common_voice_6_0
mozilla-foundation/common_voice_6_1
mozilla-foundation/common_voice_7_0
mozilla-foundation/common_voice_8_0
mozilla-foundation/common_voice_9_0
poloclub/diffusiondb
pufanyi/MIMICIT
speechcolab/gigaspeech
togethercomputer/RedPajama-Data-1T
togethercomputer/RedPajama-Data-V2
severo commented 5 months ago

I ran https://huggingface.co/spaces/severo/find_script_based_datasets_dependencies on these 714 datasets.

The only dependencies we need in services/worker are:

bigbench
conllu
datasets
faiss
h5py
huggingface_hub
lxml
numpy
openpyxl
pandas
py7zr
pyarrow
scipy
sentencepiece
tqdm
zstandard

Note that we don't support bigbench. I'll open a PR to remove all the dependencies we don't need.

severo commented 5 months ago

see https://github.com/huggingface/datasets-server/pull/2637